rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

'narrow no-break space' ("\u202f) is not recognized as a word boundary #78

Open LBeaudoux opened 4 years ago

LBeaudoux commented 4 years ago

Contrary to the 'no-break space' ("\u00A0"), the 'narrow no-break space' ("\u202f") is not recognized as a word boundary.

tokenize("La vois-tu souvent ?", "fr") returns ['la', 'vois', 'tu', 'souvent\u202f'] instead of ['la', 'vois', 'tu', 'souvent']

This is a problem because in French, some punctuation signs like ; : ! ? need to have a non breaking space (ideally a narrow one) between them and the word placed before them.

I suppose one solution would be to modify "TOKEN_RE" in the "tokens" module to take this case into account. Unless, of course, this would create undesirable effects in other languages. Another solution could be to replace "\u202f" by "\u00A0" when preprocessing French texts.

Thank you anyway for sharing this library which is for me essential when it comes to identifying the rarest words in a text.