unicode-rs / unicode-normalization

Unicode Normalization forms according to UAX#15 rules
https://unicode-rs.github.io/unicode-normalization
Other
158 stars 40 forks source link

Unexpected decompose_canonical output for 0x0DDD #59

Closed RazrFalcon closed 4 years ago

RazrFalcon commented 4 years ago

Decomposing the 0x0DDD character:

Is this a unicode-normalization bug or am I using it wrong?

Reproduces on stable release and on master.

Manishearth commented 4 years ago

These libraries are incorrect, they are not recursively normalizing the character. Please file bugs on them.

Uniview matches what we do.

RazrFalcon commented 4 years ago

But Uniview says: Character decomposition mapping: 0DDC 0DCA

Manishearth commented 4 years ago

@RazrFalcon yes, it decomposes twice, U+0DDC decomposes again.

I was talking about the NFD button on the text box, which does the right thing

RazrFalcon commented 4 years ago

Is there a way to disable recursive normalization to get the results I'm looking for?

Manishearth commented 4 years ago

No. There is no such thing as non-recursive normalization, the normalization algorithm is recursive. What you want is the direct mapping that's in the unicode tables, which is intermediate data and not as useful. This crate does not contain that data since we handle the recursive bit in the script step.

Both hb_ucd_decompose and unicodedata.decomposition are data table lookup APIs, primarily to be used to write a proper decomposition algorithm. You shouldn't be using these APIs directly: what are you attempting to do?

RazrFalcon commented 4 years ago

I see. I understand now. Thanks for the help.