Closed snomos closed 3 years ago
A repeat of https://github.com/hfst/hfst-ospell/issues/28#issuecomment-423116546
My rationale is that I don't want things like DaVvEvÁsSjÁn criticized, because it's highly likely the writer did this intentionally. Mixed caps is such a rare true positive.
I would initially argue this belongs in a higher level module with error category grading. But I can see the counter-rationale when the correction includes a non-letter.
Here are some SMJ examples of mixed case input that are accepted by the speller but should not be (data in two columns: misspelling TAB correction):
OarjjeVuodna Oarjjevuodna
DavveVássján Davve-Vássján
SisVássjá Sis-Vássjá
NuorttaSálton Nuortta-Sálton
NuorttaVuonan Nuortta-Vuonan
GiellaGálldo Giellagálldo
DoajmmaSiebrre Doajmmasiebrre
AlmasjRiektá Almasjriektá
Luhták-Áhkko Luhták-áhkko
ÅNa-duodastus ÅN-duodastus
ANa AN:a
NuorttaSállto Nuortta-Sállto
ANa AN:a
NuorttaVuonan Nuortta-Vuonan
NuorttaSálton Nuortta-Sálton
Nuortta-Vuonarahtes Nuortta-Vuodna-rahtes
ÅNa AN
HellmoCup Hellmocup
This just to point out that various variants of misspellings occur that lead to mixed case strings, and that the speller is presently not able to detect. In my opinion the speller should flag anything that is not one of: all lower (except propers and acros etc), initial upper, all upper, and mixed case when it matches the lexical form exactly. Word has settings for letting the user turn off spell checking of all upper, and mixed number letter strings. Beyond that I think it is ok to flag everything that does not match the speller dictionary.
The string DavveVássján is accepted by the smj speller (attached, rename suffix to
zhfst
), although it should not.hfst-ospell
does not accept it:I assume it is a bug in the case handling algorithm. For cases like these, the input string should be accepted IFF it is accepted exactly as given, or with the initial letter downcased, or all upper. Crucially, it should not be accepted if it is only accepted when all lowercased. In the example above, DavveVássján is not an acceptable word, although davvevássján is. But since neither DavveVássján nor davveVássján are accepted, the input string should be rejected, despite davvevássján being accepted.
smj.zip