morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
187 stars 44 forks source link

ArrayIndexOutOfBoundsException with replacement-pairs #34

Closed danielnaber closed 9 years ago

danielnaber commented 9 years ago

This exception happens only with master, not with the latest release:

Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
at morfologik.speller.HMatrix.get(HMatrix.java:81)
at morfologik.speller.Speller.findRepl(Speller.java:484)
at morfologik.speller.Speller.findRepl(Speller.java:525)
at morfologik.speller.Speller.findReplacements(Speller.java:434)
at org.languagetool.rules.spelling.morfologik.MorfologikSpeller.getSuggestions(MorfologikSpeller.java:90)
at org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule.getRuleMatches(MorfologikSpellerRule.java:182)
at org.languagetool.rules.spelling.morfologik.MorfologikSpellerRule.match(MorfologikSpellerRule.java:119)
at org.languagetool.JLanguageTool.checkAnalyzedSentence(JLanguageTool.java:601)
at org.languagetool.JLanguageTool$TextCheckCallable.call(JLanguageTool.java:937)

It happens if you get suggestions for the word you with the recently added Dutch dictionary of LanguageTool, which contains:

fsa.dict.speller.replacement-pairs=y ij

If you remove that line, it works fine. Let me know if you need more details to reproduce this.

dweiss commented 9 years ago

I'm not really that familiar with the speller -- Marcin or Jaume would have to take a look. Could it be a regression stemming from your previous pull request (GH-33)?

jaumeortola commented 9 years ago

The problem comes from my last commit: https://github.com/morfologik/morfologik-stemming/commit/675ce0555d50a41dc8dcc2818751306a05e11de5 This is also related to issue #30. I need to print out the H matrix for debugging.

danielnaber commented 9 years ago

@jaumeortola Is there a chance you're going to work on this soon? I'd like to start another try to switch the German spell checker to Morfologik but that only makes sense if it provides good suggestions.

jaumeortola commented 9 years ago

It's tricky and I'm not sure if it is really worthwhile. Before trying again, let me ask you a question. Have you tried using max_distance>1 without replacement pairs and with frequency data? Aren't the results good enough? Could you write a list of possible problems and good suggesions? Like Rytmus > Rhythmus, etc.

danielnaber commented 9 years ago

I was running some tests with max_distance > 1 and frequency data and I think I found another issue. Could you have a look at https://github.com/languagetool-org/languagetool/issues/236?

danielnaber commented 9 years ago

I tried Morfologik 1.9.0 with max_distance=2 and frequencies and often the results are quite good. There are problems with short words like daß and muß which should be corrected to dass and muss. The corrections are found, but ranked so they only appear at 14th position. I'll try to live with that.

This issue was originally about the ArrayIndexOutOfBoundsException and I suggest this should be fixed, either by rolling back the change that introduced it, or maybe by throwing an exception on startup if max_distance > 2 and replacement pairs are used.

jaumeortola commented 9 years ago

The out-of-bounds exception is solved now (https://github.com/morfologik/morfologik-stemming/pull/41). The results seem good even for max_distance>1 and replacement pairs. But probably there is still room for improvement.