7519 characters are not Unicode valid

LucasDiogoDeon commented 4 years ago

The page Unicode Character Count V13.0 shows 281,392 characters in Total Assigned. https://www.unicode.org/versions/stats/charcountv13_0.html

However, there are 288,911 (7519 more) UnicodeCharInfo's where the _unicodeCharacterDataIndex field is >= 0 (which I thought meant the character was valid).

The file https://www.unicode.org/Public/UCD/latest/ucd/UCD.zip shows the intervals of all valid Unicode characters.

I've created a program to show all invalid UnicodeCharInfo's. Demo.zip

Perhaps you could incorporate the list in your project.

Examples:

https://www.fileformat.info/info/unicode/char/2065/index.htm
- UnicodeInfo.GetCharInfo(0x2065)
- _unicodeCharacterDataIndex = 7375
https://www.fileformat.info/info/unicode/char/10FFFE/index.htm
- UnicodeInfo.GetCharInfo(0x10FFFE)
- _unicodeCharacterDataIndex = 33840

This is what I've done in my own project:

public static IReadOnlyCollection<UnicodeCharInfo> All => _all.Value;
private static readonly Lazy<IReadOnlyCollection<UnicodeCharInfo>> _all
    = new Lazy<IReadOnlyCollection<UnicodeCharInfo>>(() =>
        _list // list of all valid Unicode characters (validUnicodeCharacters)
            .Select(x => UnicodeInfo.GetCharInfo(x))
            .ToList());

hexawyz commented 4 years ago

Hi,

You should not look at implementation details such as _unicodeCharacterDataIndex. These are not public for a reason.

The correct way to determine if a code point is assigned is to look at its Category. Unassigned code points will report a category of UnicodeCategory.OtherNotAssigned.

See https://unicode-browser.azurewebsites.net/codepoints/2065 (this is still running an older version of the lib, but it should be correct)

LucasDiogoDeon commented 4 years ago

That worked!

I changed the validation.

I also forgot to include some categories while reading the UCD.zip file.

Enumerable.Range(0, 0x10FFFF)
    .Select(x => UnicodeInfo.GetCharInfo(x))
    .Where(x => x.Category != UnicodeCategory.OtherNotAssigned)
    .Count();
// returns 283440

Which is makes sense according to https://www.unicode.org/versions/stats/charcountv13_0.html

Total Designated = 283,506
Noncharacters = 66
Total Designated - Noncharacters = 283,440 valid Unicode characters

Thanks a bunch.

hexawyz / NetUnicodeInfo

7519 characters are not Unicode valid #6