languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
11.87k stars 1.37k forks source link

Does not work: Dump + Encode with frequency + Dump again #34

Open milekpl opened 10 years ago

milekpl commented 10 years ago

Jaume, I dumped the Polish dictionary, used the frequency list to encode it. But then I cannot dump the dictionary again as there is an error:

d:\download\LanguageTool-2.4-SNAPSHOT>java -cp languagetool.jar org.languagetool
.dev.DictionaryExporter pl_PL.dict >pl_PL.src

Unhandled program error occurred.
Invoke with '--help' for help.
java.lang.RuntimeException: Invalid dictionary entry format (missing separator).

```
    at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:5
```

9)
        at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:1
5)
        at morfologik.tools.FSADumpTool.dump(FSADumpTool.java:171)
        at morfologik.tools.FSADumpTool.go(FSADumpTool.java:75)
        at morfologik.tools.Tool.go(Tool.java:45)
        at morfologik.tools.FSADumpTool.main(FSADumpTool.java:285)
        at org.languagetool.dev.DictionaryExporter.main(DictionaryExporter.java:
40)

I think this is an omission on our part in morfologik-speller but it shows also in LT code.            
jaumeortola commented 10 years ago

Hi, The DictionaryExporter in LT expects the speller dictionary to be inside a "hunspell" folder:

if (new File(filename).getAbsolutePath().contains("hunspell")) {
  FSADumpTool.main("--raw-data", "-d", args[0]);
} else {
  FSADumpTool.main("--raw-data", "-x", "-d", args[0]);
}

Taking the polish dict from the hunspell folder I can dump it. But I'm not sure if everything is OK.

milekpl commented 10 years ago

Jaume, I tried to dump the dictionary from the current folder. Then the error will appear. I simply wanted to see if it was encoded properly (because there is an encoding-related bug I discovered:

I don't think hardcoding the folder helps, and -x should work for frequency dictionaries. Otherwise, we cannot say we supply the source, which violates Debian principles - this is why we have documented all decoding procedures so that one could get the original sources. This means, however, that the decoding procedure has to produce readable frequency files, I'm afraid.

See also https://github.com/morfologik/morfologik-stemming/issues/15

danielnaber commented 9 years ago

Also see https://github.com/morfologik/morfologik-stemming/issues/35

danielnaber commented 9 years ago

So I understand that the problem is that we add the -x option depending on the hard-coded directory name. Instead we need to look inside the .info file and see if the fsa.dict.encoder option is set and only use the -x option if that is the case. Is that correct?

danielnaber commented 9 years ago

@milekpl Could you maybe help with this, i.e. reply to my question above from 2014-09-24?

milekpl commented 9 years ago

@danielnaber: it won't help. The encoder will be set but frequency dictionaries have more data. These data are not dumped properly. I tried to persuade Jaume to add code to dump frequency data but this is not a trivial thing to do, as the source format is XML.