Fix tokenising when using using more than just a-zA-Z

robotdana commented 5 years ago

Previously: Händler would be tokenized as ndler or ändler depending on python version Rather than the expected händler

Solution: use regexp rather than re. This gives us the ability to use unicode character clasess such as [[:upper:]] and [[:lower:]]

Fixes #35

I'm usually a ruby developer not a python developer I don't know how to get the regex library working on 2.7 or how to compare the test strings in a unicode-aware way (they're different on my mac vs on travis, if one passes the other fails)

But it mostly works

myint commented 5 years ago

Thanks! I haven't tried the regex module before. I'll take a look when I have more time.

robotdana commented 5 years ago

If you're interested, i took the really long way round fixing this by creating my own spell checker https://github.com/robotdana/spellr

myint / scspell

Fix tokenising when using using more than just a-zA-Z #37