rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.41k stars 100 forks source link

Inconsistent tokenization in Italian depending on the version of regex #79

Closed rspeer closed 2 years ago

rspeer commented 4 years ago

We tried to standardize the tokenization of French words such as "l'heure" across different versions of mrab's regex module. The fix assumed that we want this to come out as ["l'", 'heure'], and to recognize this pattern as one or two letters, an apostrophe, and a vowel.

A similar pattern appears in Italian, but it can have four characters before the apostrophe.

On regex 2020.4.4, we get this tokenization:

>>> wordfreq.tokenize("nell'obolo", 'it')
["nell'obolo"]

But on regex 2018.2.21, we get:

>>> wordfreq.tokenize("nell'obolo", 'it')
["nell", "obolo"]

This should be standardized as well. If we insist on a particular version of regex, we will probably cause conflicts with other libraries such as spacy.

rspeer commented 2 years ago

Closed now that we require regex 2021 or later.