We tried to standardize the tokenization of French words such as "l'heure" across different versions of mrab's regex module. The fix assumed that we want this to come out as ["l'", 'heure'], and to recognize this pattern as one or two letters, an apostrophe, and a vowel.
A similar pattern appears in Italian, but it can have four characters before the apostrophe.
This should be standardized as well. If we insist on a particular version of regex, we will probably cause conflicts with other libraries such as spacy.
We tried to standardize the tokenization of French words such as "l'heure" across different versions of mrab's
regex
module. The fix assumed that we want this to come out as["l'", 'heure']
, and to recognize this pattern as one or two letters, an apostrophe, and a vowel.A similar pattern appears in Italian, but it can have four characters before the apostrophe.
On regex 2020.4.4, we get this tokenization:
But on regex 2018.2.21, we get:
This should be standardized as well. If we insist on a particular version of regex, we will probably cause conflicts with other libraries such as spacy.