This PR changes the big tokenization regex to handle cases where @ or @s appears at the end of the word. The regex now works around Unicode's default segmentation to treat this @ as a letter, because this is a way of writing gender-neutral word endings in Spanish, Portuguese, and particularly far-left Italian.
As an example, the text "l@s niñ@s" should be tokenized as ["l@s", "niñ@s"], not as ["l", "s", "niñ", "s"].
The endings "x" and "xs" are becoming more common in Spanish for this purpose, but these are already tokenized correctly. On the other hand, only the "@" version is attested in Portuguese. This steered me away from my initial plan to replace "@" with "x" in these endings in a pre-processing step.
This version now includes the new data from exquisite-corpus, so it has the words with @ in them, as well as some cleaner data from ParaCrawl.
This PR changes the big tokenization regex to handle cases where
@
or@s
appears at the end of the word. The regex now works around Unicode's default segmentation to treat this@
as a letter, because this is a way of writing gender-neutral word endings in Spanish, Portuguese, and particularly far-left Italian.As an example, the text
"l@s niñ@s"
should be tokenized as["l@s", "niñ@s"]
, not as["l", "s", "niñ", "s"]
.The endings "x" and "xs" are becoming more common in Spanish for this purpose, but these are already tokenized correctly. On the other hand, only the "@" version is attested in Portuguese. This steered me away from my initial plan to replace "@" with "x" in these endings in a pre-processing step.
This version now includes the new data from exquisite-corpus, so it has the words with @ in them, as well as some cleaner data from ParaCrawl.