mikahama / natas

Python 3 library for processing historical English
Apache License 2.0
64 stars 11 forks source link

is_correctly_spelled() is overly generous #4

Closed peeter-t2 closed 3 years ago

peeter-t2 commented 3 years ago

Hey, would use is_correctly_spelled for OCR errors but it seems overly generous.

image

I guess the pages that it finds are https://en.wiktionary.org/wiki/HATO and https://en.wiktionary.org/wiki/Teri, it nicely failed at 'itsejf'.

I wonder if it could make a more conservative match, and only offer lowercase options as real words? Perhaps optionally?

Thanks!

mikahama commented 3 years ago

Hi, Wiktionary does have a lot of weird words, which can be an issue. In the papers, we used lemmas from the Oxford English Dictionary, but the same problem is still present... There are a bunch of words in the OED that are rare or archaic.

If you want, you can pass a custom word list to the method: is_correctly_spelled(word, dictionary=["word1","word2"]). Also, if you wanted to try out your own experiments with Wiktionary, the code I used to get the lemmas is located in https://github.com/mikahama/natas/blob/master/natas/get_wiktionary_lemmas.py . By default, the method uses the dictionary stored in https://github.com/mikahama/natas/blob/master/natas/wiktionary_lemmas.json

Currently, the method lowercases the input word. Making lowercasing optional is a great idea and I will look into adding that feature.

mikahama commented 3 years ago

new dictionary in the latest version