morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
186 stars 44 forks source link

Encoded sequences can clash with separator byte and cause assertion errors #85

Closed danielnaber closed 7 years ago

danielnaber commented 7 years ago

This causes the assert to fail in TrimSuffixEncoder.java:60:

Dictionary d = Dictionary.read(Paths.get("/tmp/org/languagetool/resource/de/german_synth.dict"));
DictionaryLookup dict = new DictionaryLookup(d);
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT");  // works
dict.lookup("anfragen|DOESNOTEXIST");  // works
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT:NEB");  // AssertionError at TrimSuffixEncoder.java:60

To reproduce, you can get the german_synth.dict from http://search.maven.org/remotecontent?filepath=de/danielnaber/german-pos-dict/1.0/german-pos-dict-1.0.jar

Stacktrace is this:

java.lang.AssertionError
    at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:60)
    at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)
    at org.languagetool.CrashTest.test(CrashTest.java:18)

In a less stripped down case this shows as (at least I assume this is the same issue):

Caused by: java.lang.IndexOutOfBoundsException
    at java.nio.Buffer.checkIndex(Buffer.java:540)
    at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
    at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:62)
    at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)
dweiss commented 7 years ago

I'll look into this, thanks Daniel.

dweiss commented 7 years ago

Confirmed. The encoded length of the suffix to strip clashes with separator character. Thinking on how to fix this.

dweiss commented 7 years ago

Daniel, the simplest workaround for this problem is currently to recompile the dictionary from scratch and substitute the separator character for something with a higher value. Good examples would be ',' or '\t' since these fall below 'A' in ascii range and would wrap around to much longer suffixes.

I didn't invent this encoding scheme, it's a legacy from Janek Daciuk's work. A proper fix would involve some bookkeeping on the encoded string so that we don't accidentally trip on separator character (the input must not contain it, this is verified).

dweiss commented 7 years ago

Fixed in 2.1.2. Will publish asap. Thanks for a detailed repro, Daniel!

danielnaber commented 7 years ago

Thanks for the fast fix, I can confirm it fixes the issue - also without re-building the dictionary, or do you still recommend that?

dweiss commented 7 years ago

The fix will work with your existing dictionary if you can/ are willing to upgrade. If you want to stick to 2.1.1, you'll have to rebuild the dictionary with a different separator, unfortunately.