Potential Language Contamination Inquiry

togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.

Apache License 2.0

4.53k stars 346 forks source link

Hi @iBibek

Thanks for your question -- it is very likely that there are also other languages present in the dataset. This is because the language of a document is identified using a FastText classifier and any document with score >= 0.5 is considered to be of the respective language and is kept in the corpus. I would expect that documents with lower language scores are more likely to contain text in other languages -- so if you want to filter such instances out you can filter the dataset based on a higher language score (e.g., RefinedWeb uses 0.6, C4 goes as high as 0.99).

I hope this helps!

togethercomputer / RedPajama-Data

Potential Language Contamination Inquiry #108