Encoded sequences can clash with separator byte and cause assertion errors

danielnaber commented 7 years ago

This causes the assert to fail in TrimSuffixEncoder.java:60:

Dictionary d = Dictionary.read(Paths.get("/tmp/org/languagetool/resource/de/german_synth.dict"));
DictionaryLookup dict = new DictionaryLookup(d);
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT");  // works
dict.lookup("anfragen|DOESNOTEXIST");  // works
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT:NEB");  // AssertionError at TrimSuffixEncoder.java:60

To reproduce, you can get the german_synth.dict from http://search.maven.org/remotecontent?filepath=de/danielnaber/german-pos-dict/1.0/german-pos-dict-1.0.jar

Stacktrace is this:

java.lang.AssertionError
    at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:60)
    at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)
    at org.languagetool.CrashTest.test(CrashTest.java:18)

In a less stripped down case this shows as (at least I assume this is the same issue):

Caused by: java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkIndex(Buffer.java:540)
    at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
    at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:62)
    at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)