rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Tokenize words such as "l'heure" the same way as "l'arc" #46

Closed rspeer closed 7 years ago

rspeer commented 7 years ago

Unicode mentioned a fiddly little rule about splitting between apostrophes and vowels, which Python's regex module faithfully implemented. But in languages where this matters, such as French, it seems they forgot about splitting between apostrophes and the silent "h".

This branch adds another tokenization path that fixes up words such as "l'heure".

It also keeps the apostrophe around when include_punctuation=True, like it sounds like it should.