Open iBibek opened 8 months ago
Hi @iBibek
Thanks for your question -- it is very likely that there are also other languages present in the dataset. This is because the language of a document is identified using a FastText classifier and any document with score >= 0.5 is considered to be of the respective language and is kept in the corpus. I would expect that documents with lower language scores are more likely to contain text in other languages -- so if you want to filter such instances out you can filter the dataset based on a higher language score (e.g., RefinedWeb uses 0.6, C4 goes as high as 0.99).
I hope this helps!
The repo mentions that the dataset is composed of only these languages: en, de, fr, and es. Is there any possibility of contamination with other languages in the dataset? I would greatly appreciate your response. Thank you in advance.