hunspell / hunspell

The most popular spellchecking library.
http://hunspell.github.io/
GNU Lesser General Public License v2.1
2.13k stars 249 forks source link

want ability to store dictionaries in gzipped format #750

Open flachs opened 2 years ago

flachs commented 2 years ago

I want to be able to deploy hunspell in a small footprint. Most of the required space for hunspell is its dictionaries. Gzipping the dictionaries can save significant space. Hz isnt bad, but gz is ubiquitous and a little better.

            orig    hz  gz
a.dic           10  27  33
en_AU.aff   27375   11314   5388
en_AU.dic   513822  204246  198465
en_CA.aff   1809    1153    498
en_CA.dic   698653  376870  326433
en_GB.aff   27449   11361   5488
en_GB.dic   527337  248352  243114
en_NZ.aff   27908   11492   5635
en_NZ.dic   536528  211648  207171
en_US.aff   3045    2565    991
en_US.dic   696131  253300  246482
en_ZA.aff   27449   11361   5488
en_ZA.dic   590143  246205  260975
test.aff    3037    2537    978
test.dic    696268  253536  246668
total bytes 4376964 1845967 1753807
percentage  100.0%  42.2%   40.1%
                100.0%  95.0%
flachs commented 2 years ago

this patch does the trick for me

This code implements a 'smart-open' sort of feature for dictionary files.

ending with .hz => open with hzip ending with .gz => open with popen("gunzip -c") otherwise attempt to open as a file, if this fails, attempt to open .hz with hzip if this fails, attempt to open .gz with popen("gunzip -c") [patch.zip](https://github.com/hunspell/hunspell/files/7493659/patch.zip)
ssvb commented 1 year ago

It's strange that gzip doesn't provide a better compression. Maybe try ppmd? It's very good specifically for text compression: https://github.com/svpv/ppmd-mini