Add new normalization algorithms using Standardized Variants

sunfishcode commented 3 years ago

The standard normalization algorithm decomposes CJK compatibility ideographs into nominally equivalent codepoints, but which traditionally look different, and is one of the main reasons normalization is considered destructive in practice.

Unicode 6.3 introduced a solution for this, by providing standardized variation sequences for these codepoints. For example, while U+2F8A6 "CJK COMPATIBILITY-IDEOGRAPH-2F8A6" canonically decomposes to U+6148 with a different appearance, in Unicode 6.3 and later the standardized variation sequences in the StandardizedVariants.txt file include the following:

6148 FE00; CJK COMPATIBILITY IDEOGRAPH-2F8A6;

which says that "CJK COMPATIBILITY IDEOGRAPH-2F8A6" corresponds to U+6148 U+FE00, where U+FE00 is "VARIATION SELECTOR-1".

U+6148 and U+FE00 are both normalized codepoints, so we can transform text containing U+2F8A6 into normal form without losing information about the distinct appearance. At this time, many popular implementations ignore these variation selectors, however this technique at least preserves the information in a standardized way, so implementations could use it if they chose.

~~This PR adds "ext" versions of the nfd, nfc, nfkd, and nkfd iterators, which perform the standard algorithms extended with this technique. They don't match the standard decompositions, and don't guarantee stability, but they do produce appropriately normalized output.~~

~~I used the generic term "ext" to reflect that other extensions could theoretically be added in the future. The standard decomposition tables are limited by their stability requirements, but these "ext" versions could be free to adopt new useful rules.~~

This PR adds a new svar() function which returns an iterator that performs this technique.

I'm not an expert in any of these topics, so please correct me if I'm mistaken in any of this. Also, I'm open to ideas about how to best present this functionality in the API.

Manishearth commented 3 years ago

A note: I don't have time to review this right now, but if someone else can that would be great. I'm not opposed to adding this.

(@sujayakar ?)

sunfishcode commented 3 years ago

Just a friendly ping, in case this got overlooked :slightly_smiling_face:

sujayakar commented 3 years ago

oh, thanks for the ping, looking at this now.

I've never seen the standardized variants before; this is pretty cool!

I see that the release notes mention the application of these variants to CJK normalization, but I don't see a reference to these variants in the documentation for normalization itself (or from googling around for "variation sequence normalization"). Do you know of any other normalization implementations that handle this?

Also, we're only taking the subset of StandardizedVariants.txt that pertains to CJK compatibility, right? So, we may want to name this process something related to CJK compatibility rather than standardized variants as a whole.

I'll do a close code review too, but it overall looks good.

sujayakar commented 3 years ago

okay overall looks good modulo those nits! thanks for the quick turnaround time on the changes.

unicode-rs / unicode-normalization

Add new normalization algorithms using Standardized Variants #70