the-moisrex / webpp

C++ web framework | web development can be done with C++ as well.
https://t.me/webpp
MIT License
134 stars 9 forks source link

Unicode: normalization_form #559

Open the-moisrex opened 2 months ago

the-moisrex commented 2 months ago

A function that would check out a string (pair of iterators) to see in which normalization form the string is in.

The Unicode Character Database supplies properties that allow implementations to quickly determine whether a string x is in a particular Normalization Form—for example, isNFC(x). This is, in general, many times faster than normalizing and then comparing.

From UTS #15

public int quickCheck(String source) {
    short lastCanonicalClass = 0;
    int result = YES;
    for (int i = 0; i < source.length(); ++i) {
        int ch = source.codepointAt(i);
        if (Character.isSupplementaryCodePoint(ch)) ++i;
        short canonicalClass = getCanonicalClass(ch);
        if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
            return NO;        }
        int check = isAllowed(ch);
        if (check == NO) return NO;
        if (check == MAYBE) result = MAYBE;
        lastCanonicalClass = canonicalClass;
    }
    return result;
}

public static final int NO = 0, YES = 1, MAYBE = -1;