Closed averkij closed 1 year ago
Hi @averkij ! the cc part is only english (this gets filtered in the ccnet pipeline). There is some multilinguality in the wikipedia split: you can check out the list of languages here. C4 has also been pre-filtered for english. We don't explicitly filter the other splits for english so there might be some very rare non english content in github or arxiv.
Got it. Thanks anyway!
I understand the dataset itself contains the 20 languages but the I am wondering if the training actually contains all of them. In this Blog: https://www.together.xyz/blog/redpajama-models-v1 at the very end it is mentioned "Wikipedia (en)" in number of tokens? Can someone clarify of the checkpoint Redpajama 7B 800B is actually with 20 wikipedia languages ? thx
Hello. Thank you for your work.
Can you, please, provide information about languages in Red Pajama, or it is English only? I've downloaded the Common Crawl part, but don't see a language field in metadata.