Improve language filter

Results of Cui 2020 Language Identification on Short Textual Data suggest that large improvements in language identification can be made over langid.py. An improved language filter could be applied to unclean sources such as oscar and paracrawl.

Reading the OSCAR paper, the language identification used in OSCAR is the same as in the fastText pipeline of Grave et al 2018 Learning Word Vectors for 157 Languages. It uses character n-grams as features and outperforms langid.py for 2 of 3 test sets considered.

For more methods to be explored (or included in an ensemble-based approach), see

Grothe et al 2008 A Comparative Study on Language Identification Methods
Garg et al. 2014 A Survey of Language Identification Techniques and Applications
Jauhiainen et al. 2019 Automatic language identification in texts: a survey
Omayio et al. 2021 Language-Based Text Categorization: A Survey
Shared task "Discriminating between Similar Languages", e.g. 3rd edition Malmasi et al. 2016 Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task

jbrry / Irish-BERT

Improve language filter #95