hfst / hfst-ospell

HFST spell checker library and command line tool
Apache License 2.0
13 stars 9 forks source link

Configure option to default hfst-ospell to only N suggestions #16

Open hfst-importer opened 9 years ago

hfst-importer commented 9 years ago

There is a considerable speed difference between speller runs on the same text depending on whether hfst-ospell is allowed to give all suggestions or just a few:

tf-hsl-m0020:hfst smo036$ time preprocess test.txt | hfst-ospell -S tools/spellcheckers/fstbased/hfst/kl.zhfst > test.res.hsp-all.txt

real    0m40.156s
user    0m40.152s
sys 0m0.046s

tf-hsl-m0020:hfst smo036$ time preprocess test.txt | hfst-ospell -S -n5 tools/spellcheckers/fstbased/hfst/kl.zhfst > test.res.hsp-5.txt

real    0m10.123s
user    0m10.132s
sys 0m0.039s

tf-hsl-m0020:hfst smo036$ time preprocess test.txt | hfst-ospell -S -n10 tools/spellcheckers/fstbased/hfst/kl.zhfst > test.res.hsp-10.txt

real    0m11.897s
user    0m11.897s
sys 0m0.043s

At the same time voikkospell (which only gives 5 suggestions maximum - always) is markedly slower than hfst-ospell:

$ time preprocess test.txt | voikkospell -s -d kl -p tools/spellcheckers/fstbased/hfst/ > test.res.vk.txt 

real    0m16.588s
user    0m16.334s
sys 0m0.305s

I don't know the details of libvoikko's interactions with libhfstospell, but since there is no built-in configure-time/compile-time option to limit the number of suggestions in hfst-ospell, could it be that hfst-ospell is generating a lot of suggestions in the background that are never used? Please note that there would be fewer "misspellings" comming from voikkospell, since voikkospell handles upper/lower casing automatically, whereas hfst-ospell (at least with the tested fst) only accepts lexical case. This difference might be one explanation for voikko being faster than the all-suggestion call to hfst-ospell (but still 1,5 slower than the corresponding hfst-ospell with only 5 suggestions).

In any case I believe that being able to set a default number of suggestions at compile time is an easy way to ensure that hfst-ospell is not slower than needed.

Reported by: snomos

hfst-importer commented 9 years ago

IIRC voikko interface predates limit options and especially such limit implementation that provides speed gains, however, a good upgrade should maybe use voikko options to determine max suggestions during run-time (possibly in addition to this static max configure option), this would possibly give users of various interfaces the option to tune it themselves, although the defaults in most implementations like Offices and enchant are probably maxed at 5–8 now?

Original comment by: flammie

hfst-importer commented 9 years ago

It sounds like a good idea to use whatever voikko options there are. Different interfaces and apps have different behavior: MacOSX system wide speller does not have any restrictions, and the number of suggestions depends on the underlying speller (Hunspell seems to produce potentially huge lists of suggestions). Voikko limits the suggestions to 5 in all contexts, whereas MS Word shows 5 suggestions when rightclicking, but up to 20 suggestions in the spelling and grammar dialog. For about any user, more than 5 suggestions do not make any sense - it is too hard to see the correct one, or it takes too much time.

My idea was to use the static limit only as a default, letting the outside caller (Voikko, hfst-ospell command line tool, any other host app) set the actual limit via overrides.

Original comment by: snomos