unicode-rs / unicode-normalization

Unicode Normalization forms according to UAX#15 rules
https://unicode-rs.github.io/unicode-normalization
Other
158 stars 40 forks source link

Assertion failure *value.borrow() <= MAX_NONSTARTERS + 2 #76

Closed Manishearth closed 3 years ago

Manishearth commented 3 years ago

Found by oss-fuzz in: https://github.com/unicode-rs/unicode-normalization/blob/master/fuzz/fuzz_targets/streaming.rs

Testcase: 𑍌ؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؙؔٙ

\u{11347}\u{11357}\u{00614}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00659}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}\u{00619}

Manishearth commented 3 years ago

cc @sunfishcode I'm trying to figure out what's wrong here but I'm not sure I've gotten it yet.

Manishearth commented 3 years ago

I suspect the fuzz test is incorrect: It's testing an invariant that only makes sense for strings that have an actual base character in them.

sunfishcode commented 3 years ago

\u{11347} and \u{11357} are both starters (canonical combining class of 0), but they also compose with each other (into \u{1134c}), which is what makes this case different from common cases. So it does appear to be a bug in the fuzz test. I'll look into whether the fuzz test can be fixed.

If I can't find a way to fix it, plan b is to replace the fuzz test with one which does this:

    let mut count = 0;
    for c in stream_safe.chars().stream_safe().nfkd() {
        if canonical_combining_class(c) != 0 {
            count += 1;
            assert!(count <= MAX_NONSTARTERS);
        } else {
            count = 0;
        }
    }

That doesn't test for the streaming property I originally wrote the fuzzer for, but does test that Stream-Safe basically works, without relying on any tricky details about how the inner iterators work.

sunfishcode commented 3 years ago

I've now opened #77, which fixes the fuzz test to not over-count when there are multiple starters.