Open russcam opened 4 months ago
Oh, that's great, another breaking update to the database 😅
From what I understand, what they call "non-Chinese" are actually japanese characters. (The one they give as example is the japanese kanji for tooth: 歯) Before updating this, I'll do a quick sanity check that there is no weird stuff going here, but the best solution would be to have "Chinese Simplified" and "Japanese Simplified" properties. (AFAIK, PRC and Japan are the only two countries having applied an official simplification process of the chinese characters, so hopefully there won't be an exception)
So, I checked, and… For radical 182, I'm not sure where it comes from 🙁 For radical 208, it is indeed a Japanese kanji, but a lesser used variant. (And also not a radical? Traditional one is still the official radical) Others seem to be ok.
I don't really know what to make out of it. It would seem that when the radical field is empty it means that the character is an alternate (simplified) writing and not a proper radical, but that's a weird way to reference words here… 🤔
CJKRadicals-15.1.0.txt uses apostrophes after the radical number to indicate that the ideograph uses a standard simplification. From Unicode® Standard Annex #38 UNICODE HAN DATABASE (UNIHAN):
The
ProcessCjkRadicalsFile
method handles the single apostrophe case, but throws on the two apostrophe case athttps://github.com/hexawyz/NetUnicodeInfo/blob/16ae6bc248cc10c02d3f200a24ee998356381b0a/System.Unicode.Build.Core/UnicodeDataProcessor.cs#L246
Note also that the non-Chinese simplified form of the radical can have an empty CJK radical character if the CJK radical character is not included in the Kangxi Radicals block or the CJK Radicals Supplement block, so the following would also need to handle an empty character
https://github.com/hexawyz/NetUnicodeInfo/blob/16ae6bc248cc10c02d3f200a24ee998356381b0a/System.Unicode.Build.Core/UnicodeDataProcessor.cs#L251
I'd be happy to add support for the non-Chinese simplified form. How would you prefer to represent an empty character on
CjkRadicalData
- aschar?