Unexpected decompose_canonical output for 0x0DDD

unicode-rs / unicode-normalization

Unicode Normalization forms according to UAX#15 rules

https://unicode-rs.github.io/unicode-normalization

Other

160 stars 42 forks source link

Unexpected decompose_canonical output for 0x0DDD #59

Closed RazrFalcon closed 4 years ago

RazrFalcon commented 4 years ago

Decomposing the 0x0DDD character:

decompose_canonical/decompose_compatible: 0x0DD9 0x0DCF 0x0DCA
harfbuzz::hb_ucd_decompose: 0x0DDC 0x0DCA
python unicodedata.decomposition: 0x0DDC 0x0DCA

Is this a unicode-normalization bug or am I using it wrong?

Reproduces on stable release and on master.

Manishearth commented 4 years ago

These libraries are incorrect, they are not recursively normalizing the character. Please file bugs on them.

Uniview matches what we do.

RazrFalcon commented 4 years ago

But Uniview says: Character decomposition mapping: 0DDC 0DCA

Manishearth commented 4 years ago

@RazrFalcon yes, it decomposes twice, U+0DDC decomposes again.

I was talking about the NFD button on the text box, which does the right thing

RazrFalcon commented 4 years ago

Is there a way to disable recursive normalization to get the results I'm looking for?

Manishearth commented 4 years ago

No. There is no such thing as non-recursive normalization, the normalization algorithm is recursive. What you want is the direct mapping that's in the unicode tables, which is intermediate data and not as useful. This crate does not contain that data since we handle the recursive bit in the script step.

Both hb_ucd_decompose and unicodedata.decomposition are data table lookup APIs, primarily to be used to write a proper decomposition algorithm. You shouldn't be using these APIs directly: what are you attempting to do?

RazrFalcon commented 4 years ago

I see. I understand now. Thanks for the help.