hexawyz / NetUnicodeInfo

Unicode Character Inspector & Library providing a subset of the Unicode data for .NET clients.
https://www.nuget.org/packages/UnicodeInformation/
MIT License
54 stars 11 forks source link

Bug: Handle Non Chinese simplified form in CJKRadicals-15.1.0.txt #10

Open russcam opened 4 months ago

russcam commented 4 months ago

CJKRadicals-15.1.0.txt uses apostrophes after the radical number to indicate that the ideograph uses a standard simplification. From Unicode® Standard Annex #38 UNICODE HAN DATABASE (UNIHAN):

A single apostrophe indicates the Chinese simplified form of the radical (for example, U+9F7F 齿 for U+9F52 齒) and two apostrophes indicate the non-Chinese simplified form of the radical (for example, U+6B6F 歯 for U+9F52 齒).

The ProcessCjkRadicalsFile method handles the single apostrophe case, but throws on the two apostrophe case at

https://github.com/hexawyz/NetUnicodeInfo/blob/16ae6bc248cc10c02d3f200a24ee998356381b0a/System.Unicode.Build.Core/UnicodeDataProcessor.cs#L246

Note also that the non-Chinese simplified form of the radical can have an empty CJK radical character if the CJK radical character is not included in the Kangxi Radicals block or the CJK Radicals Supplement block, so the following would also need to handle an empty character

https://github.com/hexawyz/NetUnicodeInfo/blob/16ae6bc248cc10c02d3f200a24ee998356381b0a/System.Unicode.Build.Core/UnicodeDataProcessor.cs#L251

I'd be happy to add support for the non-Chinese simplified form. How would you prefer to represent an empty character on CjkRadicalData - as char?

hexawyz commented 3 months ago

Oh, that's great, another breaking update to the database 😅

From what I understand, what they call "non-Chinese" are actually japanese characters. (The one they give as example is the japanese kanji for tooth: 歯) Before updating this, I'll do a quick sanity check that there is no weird stuff going here, but the best solution would be to have "Chinese Simplified" and "Japanese Simplified" properties. (AFAIK, PRC and Japan are the only two countries having applied an official simplification process of the chinese characters, so hopefully there won't be an exception)

hexawyz commented 3 months ago

So, I checked, and… For radical 182, I'm not sure where it comes from 🙁 For radical 208, it is indeed a Japanese kanji, but a lesser used variant. (And also not a radical? Traditional one is still the official radical) Others seem to be ok.

I don't really know what to make out of it. It would seem that when the radical field is empty it means that the character is an alternate (simplified) writing and not a proper radical, but that's a weird way to reference words here… 🤔