hfst / hfst-ospell

HFST spell checker library and command line tool
Apache License 2.0
13 stars 9 forks source link

Mixed-case strings accepted by hfst-ospell-office #46

Closed snomos closed 3 years ago

snomos commented 5 years ago

The string DavveVássján is accepted by the smj speller (attached, rename suffix to zhfst), although it should not. hfst-ospell does not accept it:

echo DavveVássján | hfst-ospell -S smj.zhfst | head -n 10
"DavveVássján" is NOT in the lexicon:
Corrections for "DavveVássján":
Davve-Vássján    27.590923 <== this is the intended correction
Davve-Vássjá    37.590923
...

I assume it is a bug in the case handling algorithm. For cases like these, the input string should be accepted IFF it is accepted exactly as given, or with the initial letter downcased, or all upper. Crucially, it should not be accepted if it is only accepted when all lowercased. In the example above, DavveVássján is not an acceptable word, although davvevássján is. But since neither DavveVássján nor davveVássján are accepted, the input string should be rejected, despite davvevássján being accepted.

smj.zip

TinoDidriksen commented 5 years ago

A repeat of https://github.com/hfst/hfst-ospell/issues/28#issuecomment-423116546

My rationale is that I don't want things like DaVvEvÁsSjÁn criticized, because it's highly likely the writer did this intentionally. Mixed caps is such a rare true positive.

I would initially argue this belongs in a higher level module with error category grading. But I can see the counter-rationale when the correction includes a non-letter.

snomos commented 5 years ago

Here are some SMJ examples of mixed case input that are accepted by the speller but should not be (data in two columns: misspelling TAB correction):

OarjjeVuodna    Oarjjevuodna
DavveVássján    Davve-Vássján
SisVássjá   Sis-Vássjá
NuorttaSálton   Nuortta-Sálton
NuorttaVuonan   Nuortta-Vuonan
GiellaGálldo    Giellagálldo
DoajmmaSiebrre  Doajmmasiebrre
AlmasjRiektá    Almasjriektá
Luhták-Áhkko    Luhták-áhkko
ÅNa-duodastus   ÅN-duodastus
ANa AN:a
NuorttaSállto   Nuortta-Sállto
ANa AN:a
NuorttaVuonan   Nuortta-Vuonan
NuorttaSálton   Nuortta-Sálton
Nuortta-Vuonarahtes Nuortta-Vuodna-rahtes
ÅNa AN
HellmoCup   Hellmocup

This just to point out that various variants of misspellings occur that lead to mixed case strings, and that the speller is presently not able to detect. In my opinion the speller should flag anything that is not one of: all lower (except propers and acros etc), initial upper, all upper, and mixed case when it matches the lexical form exactly. Word has settings for letting the user turn off spell checking of all upper, and mixed number letter strings. Beyond that I think it is ok to flag everything that does not match the speller dictionary.