jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Improve language filter #95

Open jowagner opened 2 years ago

jowagner commented 2 years ago

Results of Cui 2020 Language Identification on Short Textual Data suggest that large improvements in language identification can be made over langid.py. An improved language filter could be applied to unclean sources such as oscar and paracrawl.

Reading the OSCAR paper, the language identification used in OSCAR is the same as in the fastText pipeline of Grave et al 2018 Learning Word Vectors for 157 Languages. It uses character n-grams as features and outperforms langid.py for 2 of 3 test sets considered.

For more methods to be explored (or included in an ensemble-based approach), see