roc-lang / unicode

Universal Permissive License v1.0
7 stars 5 forks source link

Grapheme.split function crashes #19

Open Hasnep opened 2 weeks ago

Hasnep commented 2 weeks ago

The Grapheme.split function crashes on some edge-cases, for example, running:

Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])

Crashes with the output:

The program crashed with:

        This is definitely a bug in the roc-lang/unicode package, caused by an unhandled edge case in grapheme text segmentation.

It is difficult to track down and catch every possible combination, so it would be helpful if you could log this as an issue with a reproduction.

Grapheme.split state machine state at the time was:
((AfterZWJ <opaque>), [8205, 4417], [ZWJ, L])

Here is the call stack that led to the crash:

        roc.panic
        Grapheme.splitHelp
        Grapheme.(anonymous function)
        Result.try
        Grapheme.split
        app.(anonymous function)
        Task.(anonymous function)
        .(anonymous function)
        rust.main

Optimizations can make this list inaccurate! If it looks wrong, try running without `--optimize` and with `--linker=legacy`

Here are a list of examples that crash this function:

Grapheme.split (Str.fromUtf8 [13, 204, 136, 225, 134, 168, 226, 128, 141, 234, 176, 129])
Grapheme.split (Str.fromUtf8 [224, 185, 131, 1, 225, 133, 160, 226, 128, 141, 224, 164, 128])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [225, 132, 128, 226, 128, 141, 204, 136, 31])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 204, 136, 205, 184])
Grapheme.split (Str.fromUtf8 [225, 133, 160, 226, 128, 141, 224, 164, 149])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 10])
Grapheme.split (Str.fromUtf8 [225, 134, 168, 226, 128, 141, 225, 133, 129])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 165, 141])
Grapheme.split (Str.fromUtf8 [234, 176, 128, 226, 128, 141, 224, 181, 142])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 204, 136, 240, 159, 135, 166])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 225, 134, 168])
Grapheme.split (Str.fromUtf8 [234, 176, 129, 226, 128, 141, 36])
Grapheme.split (Str.fromUtf8 [243, 160, 129, 174, 234, 176, 128, 226, 128, 141, 224, 164, 188])

They all contain U+200D the zero-width joiner character, so that's probably the source of the crash.

These examples were found by running the radamsa fuzzer using the examples in the GraphemeBreakTest data file. Hopefully this fuzz testing could be automated in the future as mentioned in #7.