Closed markusicu closed 3 months ago
Those are the only two Todhri letters with canonical decompositions, so I suspect an issue with normalization. 105C9 ; [.5237.0020.0002] # TODHRI LETTER EI 105D2 0307 ; [.5237.0020.0002] # TODHRI LETTER EI 105E4 ; [.5252.0020.0002] # TODHRI LETTER U 105DA 0307 ; [.5252.0020.0002] # TODHRI LETTER U
Markus wrote:
Even more charts are affected. Look for "105C9" on https://www.unicode.org/charts/beta/script/chart_Coptic.html https://www.unicode.org/charts/beta/script/chart_Tai_Le.html https://www.unicode.org/charts/beta/script/chart_Tifinagh.html https://www.unicode.org/charts/beta/script/chart_Hebrew.html https://www.unicode.org/charts/beta/script/chart_Syriac.html https://www.unicode.org/charts/beta/script/chart_Duployan.html https://www.unicode.org/charts/beta/script/chart_Old_Permic.html maybe more -- but not all of the script charts
Ken noted that these have canonical decompositions with U+0307.
U+0307 used to be scx=Inherited, but it is now scx=Coptic|Duployan|Hebrew|Latin|Old_Permic|Syriac|Tai_Le|Tifinagh|Todhri.
The relevant logic is here: https://github.com/unicode-org/unicodetools/blob/4a3c96846703ac90d0e843274fe81b4c9ef76605/unicodetools/src/main/java/org/unicode/text/UCA/WriteCharts.java#L932-L952
(The fix would probably be, in the following code, to not look at the scx of Zinh components if we have an actual script from something else, but I will let someone who is not trying to be on vacation figure that out, and return to my cuneiform numbers.) https://github.com/unicode-org/unicodetools/blob/4a3c96846703ac90d0e843274fe81b4c9ef76605/unicodetools/src/main/java/org/unicode/text/UCA/WriteCharts.java#L912-L926
Right, the mixed script stuff shows up at the end, and in all the places that it occurs.
I agree with the direction Robin is headed. We should ignore the script of combining characters (not just Zinh), whenever they have a base character.
If implementations follow https://www.unicode.org/reports/tr24/#Nonspacing_Marks consistently they should be ok. I suspect that assigning scx values to 0307 (etc.) that were formerly Common will end up biting implementations in uncertain ways; as happened here — if I stumbled, then others probably will. On the other hand, could nudge fixing code that could hit in other cases.
On Tue, Aug 20, 2024 at 4:57 PM Robin Leroy @.***> wrote:
(The fix would probably be, in the following code, to not look at the scx of Zinh components if we have an actual script from something else, but I will let someone who is not trying to be on vacation figure that out, and return to my cuneiform numbers.)
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/issues/922#issuecomment-2299954006, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMHDFSFKLRONXRTDSWLZSPJX7AVCNFSM6AAAAABM227HDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJZHE2TIMBQGY . You are receiving this because you were mentioned.Message ID: @.***>
working on it
Thanks!
On Wed, Aug 21, 2024 at 9:46 AM Markus Scherer @.***> wrote:
working on it
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/issues/922#issuecomment-2302535975, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGAADWTFKSRILLQOZLZSTAAHAVCNFSM6AAAAABM227HDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBSGUZTKOJXGU . You are receiving this because you were mentioned.Message ID: @.***>
@dwanders-A reports: I’m checking the Auxiliary Charts and wanted to know why 2 Todhri characters appear at the end of Coptic in the script chart: https://www.unicode.org/charts/beta/script/index.html ? Whoa… Those two Todhri characters 105C9 105E4 show up at the end Tai Le and Tifinagh too, so something is off.
Even more charts are affected. Look for "105C9" on
@macchiati ideas? Why these two?