jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins
GNU Affero General Public License v3.0
110 stars 17 forks source link

baseform: StackOverflowError in Dictionary.lookup #1

Closed dklotz closed 6 years ago

dklotz commented 9 years ago

With the baseform plugin from version 1.4.0.5 of the bundle (and ES 1.4.2), I still get StackOverflowErrors for some strings, like "ist" and "eine" (or longer strings containing them). Yes, these are typical stopwords, but since a lot of people will probably want to use the cutoff_frequency feature of ES instead of fixed lists of stopwords, this is still very relevant (apart from the fact that a token filter just should not throw an exception during normal use).

Following your feedback on the other issue I reported on this (on the baseforms plugin itself), I wrote a small script that goes through the de-lemma-utf8.txt file line by line and checks if the left- and right-hand token are the same when compared case-insensitively. You can take a look at the "script" I used here: https://gist.github.com/dklotz/cf0906d0ff68d9578f8e

Interestingly, that script finds 4937 lines were the two words are identical (apart from case), but "ist" or "eine" are NOT among the words found. So there is probably another error apart from the circular entries. Maybe the parsing logic should also be made robust enough that even circular entries would not produce an exception...

This is an example of the stack trace I'm seeing:

Exception in thread "main" org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
    at org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(AdapterActionFuture.java:92)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:79)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:61)
    at com.fileee.search.impl.DefaultSearchClient.analyze(DefaultSearchClient.java:385)
    at com.fileee.search.impl.DefaultSearchClientTest.main(DefaultSearchClientTest.java:870)
Caused by: java.util.concurrent.ExecutionException: java.lang.StackOverflowError
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:288)
    at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:261)
    at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:92)
    at org.elasticsearch.action.support.AdapterActionFuture.actionGet(AdapterActionFuture.java:72)
    ... 3 more
Caused by: java.lang.StackOverflowError
    at java.nio.charset.CharsetDecoder.replaceWith(CharsetDecoder.java:303)
    at java.nio.charset.CharsetDecoder.<init>(CharsetDecoder.java:207)
    at java.nio.charset.CharsetDecoder.<init>(CharsetDecoder.java:233)
    at sun.nio.cs.UTF_8$Decoder.<init>(UTF_8.java:84)
    at sun.nio.cs.UTF_8$Decoder.<init>(UTF_8.java:81)
    at sun.nio.cs.UTF_8.newDecoder(UTF_8.java:68)
    at java.lang.StringCoding.decode(StringCoding.java:213)
    at java.lang.String.<init>(String.java:451)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:58)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
    at org.xbib.elasticsearch.index.analysis.baseform.Dictionary.lookup(Dictionary.java:59)
...
jprante commented 9 years ago

Thanks for reporting again!

Also for the effort to write a program.

I had some time to clean up the dictionary file, see the result in https://github.com/jprante/elasticsearch-plugin-bundle/commit/68979818db8867c46bdd33b7ed46946364ee1b77

For all the input word forms on the left, the lookup algorithm terminates now.

The version I pushed out with the fix is 1.4.0.6

dklotz commented 6 years ago

Closing old issues to clean up my open issue list. Probably fixed, feel free reopen if someone else has this problem.