languagetool-org / languagetool

Style and Grammar Checker for 25+ Languages
https://languagetool.org
GNU Lesser General Public License v2.1
11.87k stars 1.37k forks source link

Decompiler no longer works for en-US & en-GB #10708

Open milekpl opened 2 weeks ago

milekpl commented 2 weeks ago

The documentation at

https://dev.languagetool.org/hunspell-support

is outdated, as it does not specify that English morfologik dictionaries are now, for some reason (which is obscure to me, given how small these files are), kept in a separate jar: english-pos-dict.jar. However, decompiling the files from the jar fails as well:

An unhandled exception occurred. Stack trace below. java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkBounds(Unknown Source) at java.nio.HeapByteBuffer.put(Unknown Source) at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:86) at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:86) at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:12) at morfologik.tools.DictDecompile.call(DictDecompile.java:80) at morfologik.tools.DictDecompile.call(DictDecompile.java:20) at morfologik.tools.CliTool.main(CliTool.java:133) at morfologik.tools.DictDecompile.main(DictDecompile.java:132) at org.languagetool.tools.DictionaryExporter.build(DictionaryExporter.java:82) at org.languagetool.tools.DictionaryExporter.main(DictionaryExporter.java:59) Done. The dictionary export has been written to en-US.txt

I did not delve deeper into it, but Polish dictionaries decompile fine. Any ideas @jaumeortola ?

jaumeortola commented 2 weeks ago

Hi @milekpl We prefer to put dictionaries in external dependencies because, even if the files are small (<1M, but some are greater), every time we update them we add a substantial amount of data to the git repo.

When you export spelling binary dictionaries, make sure that the path contains "hunspell" or "spelling". See: https://github.com/languagetool-org/languagetool/blob/2446a07a9af6a69867f0c5ee3e0222458f508b86/languagetool-tools/src/main/java/org/languagetool/tools/DictionaryExporter.java#L68) We are using that to distinguish spelling and tagger or synthesizer dictionaries. I know that this is confusing. If we remove it, we'll need a new input parameter to specify the kind of dictionary. But we'll also need to modify all the scripts that use this class.

milekpl commented 2 weeks ago

Hi @jaumeortola, thanks for the explanation. Indeed, it does work when the dictionary is stored under a hunspell directory.

Right now I have to time to work on this, but it seems to be it would be much easier just to use the existent logic of LT, and require the user to provide the language code and the explicit flag -spell. Tagging and synthesis should work the same way as before. LT is able to locate its resources, so we could simply instantiate a language and get the resource path this way, so that the user won't need to decompile a jar etc. Alternatively, provide -i with a full path and the explicit flag (-spell).

jaumeortola commented 2 weeks ago

LT is able to locate its resources, so we could simply instantiate a language and get the resource path this way, so that the user won't need to decompile a jar etc.

We could do that, yes, keeping the current methods for backward compatibility.
Anyway, what is your goal with the English dictionary? Usually, developers decompile a binary dictionary when they want to update the dictionary and need to see the contents of the old dict.

milekpl commented 2 weeks ago

Ah, needed a modern word list for English, and ours is nicely curated.

pon., 8 lip 2024, 08:52 użytkownik Jaume Ortolà @.***> napisał:

LT is able to locate its resources, so we could simply instantiate a language and get the resource path this way, so that the user won't need to decompile a jar etc.

We could do that, yes, keeping the current methods for backward compatibility. Anyway, what is your goal with the English dictionary? Usually, developers decompile a binary dictionary when they want to update the dictionary and need to see the contents of the old dict.

— Reply to this email directly, view it on GitHub https://github.com/languagetool-org/languagetool/issues/10708#issuecomment-2213178279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALBERSBVXKXLX7AO7KHNSDZLIZJ7AVCNFSM6AAAAABKPNDTH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJTGE3TQMRXHE . You are receiving this because you were mentioned.Message ID: @.***>