togethercomputer / RedPajama-Data

The RedPajama-Data repository contains code for preparing large datasets for training large language models.
Apache License 2.0
4.53k stars 346 forks source link

Language diversity #46

Closed averkij closed 1 year ago

averkij commented 1 year ago

Hello. Thank you for your work.

Can you, please, provide information about languages in Red Pajama, or it is English only? I've downloaded the Common Crawl part, but don't see a language field in metadata.

mauriceweber commented 1 year ago

Hi @averkij ! the cc part is only english (this gets filtered in the ccnet pipeline). There is some multilinguality in the wikipedia split: you can check out the list of languages here. C4 has also been pre-filtered for english. We don't explicitly filter the other splits for english so there might be some very rare non english content in github or arxiv.

averkij commented 1 year ago

Got it. Thanks anyway!

vince62s commented 1 year ago

I understand the dataset itself contains the 20 languages but the I am wondering if the training actually contains all of them. In this Blog: https://www.together.xyz/blog/redpajama-models-v1 at the very end it is mentioned "Wikipedia (en)" in number of tokens? Can someone clarify of the checkpoint Redpajama 7B 800B is actually with 20 wikipedia languages ? thx