togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

Potential Language Contamination Inquiry #108

Open iBibek opened 6 months ago

iBibek commented 6 months ago

The repo mentions that the dataset is composed of only these languages: en, de, fr, and es. Is there any possibility of contamination with other languages in the dataset? I would greatly appreciate your response. Thank you in advance.

mauriceweber commented 6 months ago

Hi @iBibek

Thanks for your question -- it is very likely that there are also other languages present in the dataset. This is because the language of a document is identified using a FastText classifier and any document with score >= 0.5 is considered to be of the respective language and is kept in the corpus. I would expect that documents with lower language scores are more likely to contain text in other languages -- so if you want to filter such instances out you can filter the dataset based on a higher language score (e.g., RefinedWeb uses 0.6, C4 goes as high as 0.99).

I hope this helps!