Open milekpl opened 10 years ago
Hi, The DictionaryExporter in LT expects the speller dictionary to be inside a "hunspell" folder:
if (new File(filename).getAbsolutePath().contains("hunspell")) {
FSADumpTool.main("--raw-data", "-d", args[0]);
} else {
FSADumpTool.main("--raw-data", "-x", "-d", args[0]);
}
Taking the polish dict from the hunspell folder I can dump it. But I'm not sure if everything is OK.
Jaume, I tried to dump the dictionary from the current folder. Then the error will appear. I simply wanted to see if it was encoded properly (because there is an encoding-related bug I discovered:
I don't think hardcoding the folder helps, and -x should work for frequency dictionaries. Otherwise, we cannot say we supply the source, which violates Debian principles - this is why we have documented all decoding procedures so that one could get the original sources. This means, however, that the decoding procedure has to produce readable frequency files, I'm afraid.
See also https://github.com/morfologik/morfologik-stemming/issues/15
So I understand that the problem is that we add the -x
option depending on the hard-coded directory name. Instead we need to look inside the .info
file and see if the fsa.dict.encoder
option is set and only use the -x
option if that is the case. Is that correct?
@milekpl Could you maybe help with this, i.e. reply to my question above from 2014-09-24?
@danielnaber: it won't help. The encoder will be set but frequency dictionaries have more data. These data are not dumped properly. I tried to persuade Jaume to add code to dump frequency data but this is not a trivial thing to do, as the source format is XML.
Jaume, I dumped the Polish dictionary, used the frequency list to encode it. But then I cannot dump the dictionary again as there is an error: