hexawyz / NetUnicodeInfo

Unicode Character Inspector & Library providing a subset of the Unicode data for .NET clients.
https://www.nuget.org/packages/UnicodeInformation/
MIT License
59 stars 11 forks source link

7519 characters are not Unicode valid #6

Closed LucasDiogoDeon closed 4 years ago

LucasDiogoDeon commented 4 years ago

The page Unicode Character Count V13.0 shows 281,392 characters in Total Assigned. https://www.unicode.org/versions/stats/charcountv13_0.html

However, there are 288,911 (7519 more) UnicodeCharInfo's where the _unicodeCharacterDataIndex field is >= 0 (which I thought meant the character was valid).

The file https://www.unicode.org/Public/UCD/latest/ucd/UCD.zip shows the intervals of all valid Unicode characters.

I've created a program to show all invalid UnicodeCharInfo's. Demo.zip

Perhaps you could incorporate the list in your project.

Examples:

This is what I've done in my own project:

public static IReadOnlyCollection<UnicodeCharInfo> All => _all.Value;
private static readonly Lazy<IReadOnlyCollection<UnicodeCharInfo>> _all
    = new Lazy<IReadOnlyCollection<UnicodeCharInfo>>(() =>
        _list // list of all valid Unicode characters (validUnicodeCharacters)
            .Select(x => UnicodeInfo.GetCharInfo(x))
            .ToList());
hexawyz commented 4 years ago

Hi,

You should not look at implementation details such as _unicodeCharacterDataIndex. These are not public for a reason.

The correct way to determine if a code point is assigned is to look at its Category. Unassigned code points will report a category of UnicodeCategory.OtherNotAssigned.

See https://unicode-browser.azurewebsites.net/codepoints/2065 (this is still running an older version of the lib, but it should be correct)

LucasDiogoDeon commented 4 years ago

That worked!

I changed the validation.

I also forgot to include some categories while reading the UCD.zip file.

Enumerable.Range(0, 0x10FFFF)
    .Select(x => UnicodeInfo.GetCharInfo(x))
    .Where(x => x.Category != UnicodeCategory.OtherNotAssigned)
    .Count();
// returns 283440

Which is makes sense according to https://www.unicode.org/versions/stats/charcountv13_0.html

Thanks a bunch.