rspeer / wordfreq

Access a database of word frequencies, in various natural languages.
Other
1.4k stars 101 forks source link

Recognize "@" in gender-neutral word endings as part of the token #60

Closed rspeer closed 6 years ago

rspeer commented 6 years ago

This PR changes the big tokenization regex to handle cases where @ or @s appears at the end of the word. The regex now works around Unicode's default segmentation to treat this @ as a letter, because this is a way of writing gender-neutral word endings in Spanish, Portuguese, and particularly far-left Italian.

As an example, the text "l@s niñ@s" should be tokenized as ["l@s", "niñ@s"], not as ["l", "s", "niñ", "s"].

The endings "x" and "xs" are becoming more common in Spanish for this purpose, but these are already tokenized correctly. On the other hand, only the "@" version is attested in Portuguese. This steered me away from my initial plan to replace "@" with "x" in these endings in a pre-processing step.

This version now includes the new data from exquisite-corpus, so it has the words with @ in them, as well as some cleaner data from ParaCrawl.