morfologik / morfologik-stemming

Tools for finite state automata construction and dictionary-based morphological dictionaries. Includes Polish stemming dictionary.
BSD 3-Clause "New" or "Revised" License
187 stars 44 forks source link

findReplacements() suboptimal results with replacement pairs #30

Closed danielnaber closed 9 years ago

danielnaber commented 10 years ago

I'm not sure if the algorithm is guaranteed to return the best replacements, but this one looks like a bug. It ranks "Mitmuss" better than "Rhythmus" when the misspelled word is "Rytmus".

public static void main(String[] args) throws Exception {
  File infoFile = new File("/tmp/morfologik.info");
  FileWriter fw1 = new FileWriter(infoFile);
  fw1.write("fsa.dict.separator=+\n");
  fw1.write("fsa.dict.encoding=utf-8\n");
  // without this, suggestions improve:
  fw1.write("fsa.dict.speller.replacement-pairs=s ss\n");
  fw1.close();

  File inputFile = new File("/tmp/morfologik.txt");
  FileWriter fw2 = new FileWriter(inputFile);
  fw2.write("Mitmuss\n");
  fw2.write("Rhythmus\n");
  fw2.close();

  File outputFile = new File("/tmp/morfologik.dict");
  String[] buildToolOptions =
          {"-i", inputFile.getAbsolutePath(), "-o", outputFile.getAbsolutePath()};
  FSABuildTool.main(buildToolOptions);

  Dictionary dictionary = Dictionary.read(outputFile);
  Speller speller = new Speller(dictionary, 2);
  List<String> replacements = speller.findReplacements("Rytmus");
  // -will print "[Mitmuss, Rhythmus]"
  // -will print "[Rhythmus]" if there are no replacement pairs
  // -Debugging shows that in the CandidateData constructor, 'Mitmuss' gets create with a distance of 0
  System.out.println("replacements: " + replacements);
}
jaumeortola commented 10 years ago

Using the replacement pair (s ss), the result is consistent. The distance Rytmus/Mitmuss is 2 (R/M, i/y). The distance Rytmus/Rhythmus is also 2 (two missing h). To improve the results you need to add other replacement pairs (for example: t th), or word frequencies. (The replacement pairs add no distance. A possible refinement would be to consider a .5 distance for the replacements.) You say that "Mitmuss gets a distance of 0". What is the distance of Rhythmus? It should be the same. If it is not the same, then there is a bug.

danielnaber commented 10 years ago

I see, but in my debugger I get these for the CandidateData constructor:

Mitmuss: distance=0, this.distance=25
Rhythmus: distance=2, this.distance=77
danielnaber commented 10 years ago

I think I found another example that's maybe clearer. Same code as above, but using t d as replacement pairs and Band and Wald as words in the dictionary. Searching for Walt, I get [Band, Wald].

jaumeortola commented 10 years ago

The truth is that the replacement pairs were never tested with max_distance>1, because it was tricky enough just with max_distance=1. So it is probably buggy, and I don't know if it is possible to make it work. The replacement pairs were, in fact, a way to avoid the costs of searching with max_distance>1. So I suggest using either max_distance=1 + replacement pairs or max_distance>1. Anyway I will try some debugging.

jaumeortola commented 10 years ago

I found the bug. That will solve some problems. I will make a pull request after some more testing. There remain some issues: pair replacement doesn't work when there is a previous deletion/insertion error (i. e. the lenghts of word and candidate are not the same as in Rytmus/Rhythmus).

danielnaber commented 9 years ago

Just trying the latest code from git and I see it's getting better, thanks. I think I found another issues (not sure if it's new to you): Same code as above, but using t d as replacement pairs and Band, Wald, and Waid as words in the dictionary. Searching for Walt, I get [Waid, Wald] with distance 1 (and [Waid, Wald, Band] with distance 2). Debugging shows that Waid has a distance of 0 in the CandidateData constructor.

danielnaber commented 9 years ago

Here's another test case for the same (?) issue: https://github.com/languagetool-org/languagetool/blob/master/languagetool-language-modules/de/src/test/java/org/languagetool/rules/de/GermanSpellerRuleTest.java#L297 (testMorfologikSpeller())

jaumeortola commented 9 years ago

The problem with ist/is/die is not the same issue. It is previous to the main changes I made in the Speller (Feb 2014). Probably it is related to this FIXME. The band around the diagonal of the Matrix needs to be wider, I think. Does it make sense, @milekpl?

dweiss commented 9 years ago

Seems to be fixed, closing. Reopen if needed.