unicode-org / unicodetools

home of unicodetools and https://util.unicode.org JSPs
https://util.unicode.org
Other
52 stars 41 forks source link

make CLDR radical-stroke order = UAX38 #909

Closed markusicu closed 3 months ago

markusicu commented 3 months ago

Make the CLDR radical-stroke order of CJK ideographs match the order in UAX38 section 2.1.2 Sorting Algorithm Used by the Radical-Stroke Indexes.

For

Changes:

Other code changes

The modified output goes into CLDR:

The full radical-stroke order is printed into the FractionalUCA.txt file there. (file format documentation) See the CLDR PR for the diffs.

This change also affects some of the CLDR Chinese tailoring data.

Sample FractionalUCA [radical] data diffs:

-[radical 1=⼀一:一𪛙丁-丆𠀀-𠀂𬺰𰀀万-丌亐卄𠀃-𠀆𪛚𪜀𪜁𫝀𬺱-𬺴𰀁-𰀄不-专丗𠀇-𠀌𪜂𫠡𬺵-𬺹𮯰𰀅-𰀇且-世丘-丝㐀𠀍-𠀗𫠢𫠣𬺺-𬺾𰀈-𰀊𱍐丞-丢㐁㐂𠀘-𠀚𠀜𠀞-𠀠𫝁𫠤𫠥𬺿-𬻉𰀋𱍑丣-严丽鿖𠀡-𠀤𠀦-𠀨𠀪𠀫𫝂𫠦-𫠩𬻊-𬻒𰀌𱍒並丧𠀬-𠀮𠀰-𠀴𪜃𫠪-𫠭𬻓-𬻘𰀍𱍓-𱍗鿗𠀵𠀶𠀸𠀺𠀻𪜄𫠮𬻙-𬻝𰀎-𰀑𠀽-𠁀𠤢𪜅𫠯-𫠲𬻞-𬻠𰀒-𰀕𱍘-𱍝𠁁-𠁅𪜆𫠳-𫠵𬻡-𬻥𱍞𱍟𠁆-𠁈𠁊𠁋𫠶𬻦-𬻨𰀖-𰀘𱍠𱍡𠁌𠁍𫠷-𫠼𬻩-𬻮𰀙𰀚𱍢-𱍤𠁎-𠁒𫝃𫠽𬻯𰀛𰀜𱍥䶶𠁓𠁔𫠾𫠿𬻰𰀝𱍦𱍧𠁕𠁗-𠁛𠁝𤳏𪜇𫡀𱍨𠁖𰀞𱍩𠁟𫡁𫡂𠁠𰀟𬻱𱍪]
+[radical 1=⼀一:一𪛙丁-丆𠀀-𠀂𬺰𰀀万-丌亐卄𠀃-𠀆𪛚𪜀𪜁𫝀𬺱-𬺴𰀁-𰀄不-专丗𠀇-𠀌𪜂𫠡𬺵-𬺹𰀅-𰀇𮯰且-世丘-丝㐀𠀍-𠀗𫠢𫠣𬺺-𬺾𰀈-𰀊𱍐丞-丢㐁㐂𠀘-𠀚𠀜𠀞-𠀠𫝁𫠤𫠥𬺿-𬻉𰀋𱍑丣-严丽鿖𠀡-𠀤𠀦-𠀨𠀪𠀫𫝂𫠦-𫠩𬻊-𬻒𰀌𱍒並丧𠀬-𠀮𠀰-𠀴𪜃𫠪-𫠭𬻓-𬻘𰀍𱍓-𱍗鿗𠀵𠀶𠀸𠀺𠀻𪜄𫠮𬻙-𬻝𰀎-𰀑𠀽-𠁀𠤢𪜅𫠯-𫠲𬻞-𬻠𰀒-𰀕𱍘-𱍝𠁁-𠁅𪜆𫠳-𫠵𬻡-𬻥𱍞𱍟𠁆-𠁈𠁊𠁋𫠶𬻦-𬻨𰀖-𰀘𱍠𱍡𠁌𠁍𫠷-𫠼𬻩-𬻮𰀙𰀚𱍢-𱍤𠁎-𠁒𫝃𫠽𬻯𰀛𰀜𱍥䶶𠁓𠁔𫠾𫠿𬻰𰀝𱍦𱍧𠁕𠁗-𠁛𠁝𤳏𪜇𫡀𱍨𠁖𰀞𱍩𠁟𫡁𫡂𠁠𰀟𬻱𱍪]
...
 [radical 179=⾲韭:韭韮䪞𩐁𩐂𲊦𱂍韯䪟𩐃韰𩐄韱䪠𩐅-𩐈韲䪡䪢𩐉𩐊䪣𩐋𩐍𩐎𱂎䪤𩐌𩐏-𩐓䪥𩐔-𩐖]
-[radical 180=⾳音:音竟章䪦-䪨𩐗𮧶𮧷𮸱𱂏韴韵䪩𩐘𩐙𫖗𮸲𮸳韶韷䪪𩐚-𩐝𫖘𬰹-𬰻𮧸𩐞-𩐦𬰼𮧹𮧺𲊧韸䪫䪬𩐧-𩐬𬰽𮧻𱂐𱂑𩐭-𩐰𲊨韹韺䪭𩐱-𩐴𫖙𮧼𱂒𱂓𲊩韻韼䪮䪯𩐵-𩐸𮧽韽-響𩐹-𩐾𫖚𩐿-𩑁𫖛𮧾䪰𩑂-𩑆𮧿頀𩑇𩑈𫖜𬰾𩑉𩑊]
-[radical 181=⾴頁:頁𩑋頂-頄𩑌-𩑏𬰿項-頉䪱䪲𩑐-𩑘𬱀頊-頓䪳-䪵𩑙-𩑯𫖝𮨀-𮨂𱂔𱂕頔-頚䪶-䪾𩑰-𩒎𫖞𬱁𬱂𮨃-𮨆𲊪-𲊬頛-頣頦-頬䪿-䫂𩒏-𩒭𬱃𮨇-𮨊𱂖𱂗𲊭𲊮頤頥頭-頽䫃-䫊𩒮-𩓜𫖟𫖠𬱄-𬱇𮨋𮨌𱂘𱂙𲊯頿-顊䫋-䫓𩓝-𩓿𫖡𬱈𬱉𮨍-𮨔𲊰-𲊲頾顋-顕䫔-䫝𩔀-𩔘𫖢𫖣𬱊𬱋𮨕𮨖𲊳𲊴顖-類䫞-䫧𩔙-𩔲𫖤𮨗-𮨛𱂚-𱂜𲊵𲊶顟-顣䫨-䫫𩔳-𩕈𫖥𫖦𬱌𬱍𮨜𮨝𲊷顤-顨䫬-䫱𩕉-𩕞𫖧𬱎𮨞𮨟𱂝𲊸顩-顫䫲-䫴𩕟-𩕫𫖨𬱏𮨠𮨡顬-顯𩕬-𩕽𱂞顰䫵䫶𩕾-𩖅𫖩𬱐𮨢𮨣顱顲䫷𩖆-𩖈𮨤𮨥𩖉-𩖎𬱑𱂟顳顴𩖏-𩖓𬱒]
-[radical 181'=⻚页:页-顷𬱓顸-须𫖪𮸴𱂠𲊹顼-预𫖫𫠆𬱔𬱕𱂡颅-颈𫖬𫖭𬱖-𬱚𱂢𲊺𲊻颉-颏𫖮-𫖱𬱛-𬱢𱂣-𱂨颐-颖𫖲𫖳𬱣-𬱥𱂩-𱂬𲊼𲊽颗𩖕𩖖𫖴-𫖶𬱦-𬱬𮸵𱂭-𱂰题-额𫖷𬱭-𬱯𮸶-𮸸𱂱-𱂳𲊾-𲋀颞-颡𫖸𬱰𱂴-𱂹𲋁𫖹𱂺颢颣𬱱𮸹𮸺𱂻𲋂颤𩖗𲋃颥𬱲颦𫖺颧𬱳]
-[radical 182=⾵風:風䫸𩖘𩖙𮨦颩颪䫹𩖚-𩖡颫颬䫺-䫽𩖢-𩖯𩖱-𩖳𫖻𮨧𱂼颭-颱䫾-䬃𩖴-𩗃𮨨𱂽-𱂿颲颳䬄䬅𠙬𩗄-𩗒𮨩-𮨫颴颵䬆-䬊𩗓-𩗧𮨬𱃀-𱃂𲋅颶颷䬋-䬐𩗨-𩘄𮨭-𮨯𱃃-𱃆颸-颺䬑-䬗𩘅-𩘍𩘏-𩘛𬱴𱃇-𱃉𲋆颻-飀䬘-䬚𩘎𩘜-𩘬𮨰𱃊飁-飄䬛䬜𩘭-𩘷𮨱𱃋-𱃍飅-飊䬝𩘸-𩙇飋𩙈-𩙋𩙍𮨲𱃎𱃏𲋉𲋊䬞𩙎-𩙐𫗅𲋌䬟𩙑-𩙕𫗆𱃐𲋍𩙖-𩙜𱃑飌飍𩙝𩙞𱃒𱃓𩙟𮨳𩙠-𩙤]
-[radical 182'=⻛风:风飏𱃔𫗇𫠇𬱵𬱷𱃕𲋎𲋏飐-飒𩙥𩙦𫠈𬱸𬱺𱃖𱃗𲋐𬱼𱃘-𱃚𩙧𫗈𬱽𱃛飓𩙨-𩙪𫗉𬱾-𬲀𮨴𮸻𱃜𱃝𲋑飔飖𩙫𩙬𫗊𱃞飕飗𩙭𩙮飘𮨵𱃟飙飚𩙯𬲅𬲆𮸼𱃠𩙰𫗋]
-[radical 182''=𲋄:𲋄𬱶𫖼𬱹𬱻𫖽-𫖿𬲁𬲂𲋇𲋈𫗀-𫗂𬲃𬲄𩙌𫗃𫗄𬲇𲋋𬲈]
+[radical 180=⾳音:音竟章䪦-䪨𩐗𮧶𮧷𱂏𮸱韴韵䪩𩐘𩐙𫖗𮸲𮸳韶韷䪪𩐚-𩐝𫖘𬰹-𬰻𮧸𩐞-𩐦𬰼𮧹𮧺𲊧韸䪫䪬𩐧-𩐬𬰽𮧻𱂐𱂑𩐭-𩐰𲊨韹韺䪭𩐱-𩐴𫖙𮧼𱂒𱂓𲊩韻韼䪮䪯𩐵-𩐸𮧽韽-響𩐹-𩐾𫖚𩐿-𩑁𫖛𮧾䪰𩑂-𩑆𮧿頀𩑇𩑈𫖜𬰾𩑉𩑊]
+[radical 181=⾴頁⻚页:頁𩑋页頂-頄𩑌-𩑏𬰿顶顷𬱓項-頉䪱䪲𩑐-𩑘𬱀顸-须𫖪𱂠𲊹𮸴頊-頓䪳-䪵𩑙-𩑯𫖝𮨀-𮨂𱂔𱂕顼-预𫖫𫠆𬱔𬱕𱂡頔-頚䪶-䪾𩑰-𩒎𫖞𬱁𬱂𮨃-𮨆𲊪-𲊬颅-颈𫖬𫖭𬱖-𬱚𱂢𲊺𲊻頛-頣頦-頬䪿-䫂𩒏-𩒭𬱃𮨇-𮨊𱂖𱂗𲊭𲊮颉-颏𫖮-𫖱𬱛-𬱢𱂣-𱂨頤頥頭-頽䫃-䫊𩒮-𩓜𫖟𫖠𬱄-𬱇𮨋𮨌𱂘𱂙𲊯颐-颖𫖲𫖳𬱣-𬱥𱂩-𱂬𲊼𲊽頿-顊䫋-䫓𩓝-𩓿𫖡𬱈𬱉𮨍-𮨔𲊰-𲊲颗𩖕𩖖𫖴-𫖶𬱦-𬱬𱂭-𱂰𮸵頾顋-顕䫔-䫝𩔀-𩔘𫖢𫖣𬱊𬱋𮨕𮨖𲊳𲊴题-额𫖷𬱭-𬱯𱂱-𱂳𲊾-𲋀𮸶-𮸸顖-類䫞-䫧𩔙-𩔲𫖤𮨗-𮨛𱂚-𱂜𲊵𲊶颞-颡𫖸𬱰𱂴-𱂹𲋁顟-顣䫨-䫫𩔳-𩕈𫖥𫖦𬱌𬱍𮨜𮨝𲊷𫖹𱂺顤-顨䫬-䫱𩕉-𩕞𫖧𬱎𮨞𮨟𱂝𲊸颢颣𬱱𱂻𲋂𮸹𮸺顩-顫䫲-䫴𩕟-𩕫𫖨𬱏𮨠𮨡颤𩖗𲋃顬-顯𩕬-𩕽𱂞颥𬱲顰䫵䫶𩕾-𩖅𫖩𬱐𮨢𮨣颦𫖺顱顲䫷𩖆-𩖈𮨤𮨥𩖉-𩖎𬱑𱂟颧𬱳顳顴𩖏-𩖓𬱒]
+[radical 182=⾵風⻛风𲋄:風风𲋄䫸𩖘𩖙𮨦颩颪䫹𩖚-𩖡飏𱃔颫颬䫺-䫽𩖢-𩖯𩖱-𩖳𫖻𮨧𱂼𫗇𫠇𬱵𬱷𱃕𲋎𲋏𬱶颭-颱䫾-䬃𩖴-𩗃𮨨𱂽-𱂿飐-飒𩙥𩙦𫠈𬱸𬱺𱃖𱃗𲋐𫖼𬱹𬱻颲颳䬄䬅𠙬𩗄-𩗒𮨩-𮨫𬱼𱃘-𱃚颴颵䬆-䬊𩗓-𩗧𮨬𱃀-𱃂𲋅𩙧𫗈𬱽𱃛颶颷䬋-䬐𩗨-𩘄𮨭-𮨯𱃃-𱃆飓𩙨-𩙪𫗉𬱾-𬲀𮨴𱃜𱃝𲋑𮸻𫖽颸-颺䬑-䬗𩘅-𩘍𩘏-𩘛𬱴𱃇-𱃉𲋆飔飖𩙫𩙬𫗊𱃞𫖾𫖿𬲁𬲂𲋇𲋈颻-飀䬘-䬚𩘎𩘜-𩘬𮨰𱃊飕飗𩙭𩙮𫗀-𫗂𬲃𬲄飁-飄䬛䬜𩘭-𩘷𮨱𱃋-𱃍飘𮨵𱃟飅-飊䬝𩘸-𩙇飙飚𩙯𬲅𬲆𱃠𮸼飋𩙈-𩙋𩙍𮨲𱃎𱃏𲋉𲋊𩙰𫗋𩙌𫗃𫗄𬲇𲋋䬞𩙎-𩙐𫗅𲋌䬟𩙑-𩙕𫗆𱃐𲋍𩙖-𩙚𬲈𩙛𩙜𱃑飌飍𩙝𩙞𱃒𱃓𩙟𮨳𩙠-𩙤]
markusicu commented 3 months ago

It is unclear whether this change accommodates a second non-Chinese simplified radical, which is expressed using three apostrophes, and which is new for Unicode Version 16.0. See the Proposed Update of UAX #38.

I did that three months ago in one of the radical-and-simplified parsers:

The refactored parser here, at the end of RadicalStroke.java, handles up to three apostrophes, quoting the 16.0 proposed.html version of UAX38.

markusicu commented 3 months ago

PS: There is no character with a three-apostrophe radical as the primary kRSUnicode value. There are only two characters where such a radical is in the secondary value. This code only uses the primary value. That's why I had missed updating that version of the parser before -- it never saw a triple apostrophe.

kenlunde commented 3 months ago

Ah, good. For reference, draft code point U+3347B in the Extension J block (Unicode Version 17.0) will have 212'''.4 as its primary kRSUnicode property value.

markusicu commented 3 months ago

FYI: The output CLDR files are now in https://github.com/unicode-org/cldr/pull/3960. I have updated the description of this PR here with a link for that as well.

markusicu commented 3 months ago

Could I get a review please? Or maybe a rubber stamp? @macchiati? @pedberg-icu approved the output in CLDR, maybe you could look at the generator changes here?

macchiati commented 3 months ago

done

On Thu, Aug 15, 2024 at 10:41 AM Markus Scherer @.***> wrote:

Could I get a review please? Or maybe a rubber stamp? @macchiati https://github.com/macchiati? @pedberg-icu https://github.com/pedberg-icu approved the output in CLDR, maybe you could look at the generator changes here?

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/unicodetools/pull/909#issuecomment-2291837008, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMGBJL5FVIYYFBGFVC3ZRTR4PAVCNFSM6AAAAABMPLTH4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJRHAZTOMBQHA . You are receiving this because you were mentioned.Message ID: @.***>

markusicu commented 3 months ago

done

tnx!