Open alexrutar opened 4 days ago
I realize I removed exactly one normalization that was present before: \u{2184}
which was part of an entirely different block Number Forms.
Since the previous block ends at \u{209f}
this made table 3 in the previous implementation unnecessarily large.
If \u{2184}
is badly missed it could be re-added (and I guess along with the rest of the 'Number Forms' block)?
This improves the normalization for Latin characters, mainly to address the concerns in #51 . This adds a very large number of new normalizations, especially in the 'Latin Extended Additional' block which for some reason was missing every capital letter.
I did not add normalizations in any new Unicode blocks, but I did slightly extend the 'Latin 1' block to also capture some of the subscripts; this is for consistency with the 'Subscripts and Superscripts' block which was previously handled. I also preserved the actual implementation of the
normalize
function in terms of the check order, etc. In particular, the generated code should be approximately the same. To verify this, I ran some crude benchmarks on a variety of input (all ASCII, sparse Unicode, heavy Unicode, all outside normalizatio ranges) and there was no observable difference, but definitely not super rigorous.Finally, I inlined all of the char blocks, rather than replying on the 'sparse table' static generation which was implemented earlier. At least in my mind it is a bit easier to read in this form. It also makes it much clearer when characters are missed.
If someone knows more about proper transliteration, I would be happy if they could take a peek through the transformations; I only applied the transliteration in cases where I was confident and hopefully did not make any controversial normalizations.
Two questions for discussion:
chars::normalize
a reasonable name? Maybe it would be more precise to call itchars::normalize_latin
. But I guess this is quite an annoying breaking change. But the signature is the same so it would be easy enough to include an alias and mark it is#[deprecated]
.const fn
reasonable?