unicode-org / unicodetools

home of unicodetools and https://util.unicode.org JSPs
https://util.unicode.org
Other
52 stars 41 forks source link

Unicode 16 aux Script Charts bug: 2 Todhri characters in charts for several other scripts #922

Closed markusicu closed 3 months ago

markusicu commented 3 months ago

@dwanders-A reports: I’m checking the Auxiliary Charts and wanted to know why 2 Todhri characters appear at the end of Coptic in the script chart: https://www.unicode.org/charts/beta/script/index.html ? Whoa… Those two Todhri characters 105C9 105E4 show up at the end Tai Le and Tifinagh too, so something is off.

Even more charts are affected. Look for "105C9" on

@macchiati ideas? Why these two?

UnicodeData.txt
105C9;TODHRI LETTER EI;Lo;0;L;105D2 0307;;;;N;;;;;
105E4;TODHRI LETTER U;Lo;0;L;105DA 0307;;;;N;;;;;

Scripts.txt
105C0..105F3  ; Todhri # Lo  [52] TODHRI LETTER A..TODHRI LETTER OO

Nothing in ScriptExtensions.txt
Ken-Whistler commented 3 months ago

Those are the only two Todhri letters with canonical decompositions, so I suspect an issue with normalization. 105C9 ; [.5237.0020.0002] # TODHRI LETTER EI 105D2 0307 ; [.5237.0020.0002] # TODHRI LETTER EI 105E4 ; [.5252.0020.0002] # TODHRI LETTER U 105DA 0307 ; [.5252.0020.0002] # TODHRI LETTER U

eggrobin commented 3 months ago

Markus wrote:

Even more charts are affected. Look for "105C9" on https://www.unicode.org/charts/beta/script/chart_Coptic.html https://www.unicode.org/charts/beta/script/chart_Tai_Le.html https://www.unicode.org/charts/beta/script/chart_Tifinagh.html https://www.unicode.org/charts/beta/script/chart_Hebrew.html https://www.unicode.org/charts/beta/script/chart_Syriac.html https://www.unicode.org/charts/beta/script/chart_Duployan.html https://www.unicode.org/charts/beta/script/chart_Old_Permic.html maybe more -- but not all of the script charts

Ken noted that these have canonical decompositions with U+0307.

U+0307 used to be scx=Inherited, but it is now scx=Coptic|Duployan|Hebrew|Latin|Old_Permic|Syriac|Tai_Le|Tifinagh|Todhri.

The relevant logic is here: https://github.com/unicode-org/unicodetools/blob/4a3c96846703ac90d0e843274fe81b4c9ef76605/unicodetools/src/main/java/org/unicode/text/UCA/WriteCharts.java#L932-L952

eggrobin commented 3 months ago

(The fix would probably be, in the following code, to not look at the scx of Zinh components if we have an actual script from something else, but I will let someone who is not trying to be on vacation figure that out, and return to my cuneiform numbers.) https://github.com/unicode-org/unicodetools/blob/4a3c96846703ac90d0e843274fe81b4c9ef76605/unicodetools/src/main/java/org/unicode/text/UCA/WriteCharts.java#L912-L926

macchiati commented 3 months ago

Right, the mixed script stuff shows up at the end, and in all the places that it occurs.

I agree with the direction Robin is headed. We should ignore the script of combining characters (not just Zinh), whenever they have a base character.

If implementations follow https://www.unicode.org/reports/tr24/#Nonspacing_Marks consistently they should be ok. I suspect that assigning scx values to 0307 (etc.) that were formerly Common will end up biting implementations in uncertain ways; as happened here — if I stumbled, then others probably will. On the other hand, could nudge fixing code that could hit in other cases.

On Tue, Aug 20, 2024 at 4:57 PM Robin Leroy @.***> wrote:

(The fix would probably be, in the following code, to not look at the scx of Zinh components if we have an actual script from something else, but I will let someone who is not trying to be on vacation figure that out, and return to my cuneiform numbers.)

https://github.com/unicode-org/unicodetools/blob/4a3c96846703ac90d0e843274fe81b4c9ef76605/unicodetools/src/main/java/org/unicode/text/UCA/WriteCharts.java#L912-L926

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/issues/922#issuecomment-2299954006, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMHDFSFKLRONXRTDSWLZSPJX7AVCNFSM6AAAAABM227HDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZHE2TIMBQGY . You are receiving this because you were mentioned.Message ID: @.***>

markusicu commented 3 months ago

working on it

macchiati commented 3 months ago

Thanks!

On Wed, Aug 21, 2024 at 9:46 AM Markus Scherer @.***> wrote:

working on it

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/issues/922#issuecomment-2302535975, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGAADWTFKSRILLQOZLZSTAAHAVCNFSM6AAAAABM227HDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGUZTKOJXGU . You are receiving this because you were mentioned.Message ID: @.***>