rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Fix regex's inconsistent word breaking around apostrophes #77

Closed rspeer closed 4 years ago

rspeer commented 4 years ago

Relaxing the dependency on regex had an unintended consequence in 2.3.1: it could no longer get the frequency of French phrases such as "l'écran" because their tokenization behavior changed.

Fix this with a more complex tokenization rule that should handle apostrophes the same across these various versions of regex.

(I ran black so it could format these ugly expressions appropriately; there are some miscellaneous formatting changes to tokens.py that came along as a result.)