Closed LucasDiogoDeon closed 4 years ago
Hi,
You should not look at implementation details such as _unicodeCharacterDataIndex
. These are not public for a reason.
The correct way to determine if a code point is assigned is to look at its Category
. Unassigned code points will report a category of UnicodeCategory.OtherNotAssigned.
See https://unicode-browser.azurewebsites.net/codepoints/2065 (this is still running an older version of the lib, but it should be correct)
That worked!
I changed the validation.
I also forgot to include some categories while reading the UCD.zip file.
Enumerable.Range(0, 0x10FFFF)
.Select(x => UnicodeInfo.GetCharInfo(x))
.Where(x => x.Category != UnicodeCategory.OtherNotAssigned)
.Count();
// returns 283440
Which is makes sense according to https://www.unicode.org/versions/stats/charcountv13_0.html
Thanks a bunch.
The page Unicode Character Count V13.0 shows 281,392 characters in Total Assigned. https://www.unicode.org/versions/stats/charcountv13_0.html
However, there are 288,911 (7519 more) UnicodeCharInfo's where the _unicodeCharacterDataIndex field is >= 0 (which I thought meant the character was valid).
The file https://www.unicode.org/Public/UCD/latest/ucd/UCD.zip shows the intervals of all valid Unicode characters.
I've created a program to show all invalid UnicodeCharInfo's. Demo.zip
Perhaps you could incorporate the list in your project.
Examples:
This is what I've done in my own project: