microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.6k stars 274 forks source link

RoBERTa Corpus #181

Closed stephencurry-web closed 1 week ago

stephencurry-web commented 6 months ago

RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering? The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.

t1101675 commented 6 months ago

We didn't perform data filtering for the corpus. We construct the data by

  1. Combine these sources.
  2. Shuffle the documents.
  3. Tokenize them into chunks with 512 tokens.
  4. Split the first 20M chunks for training (in practice, we stopped tokenization until the tokenized data contains 20M chunks)