RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering?
The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.
RoBERTa Corpus is a combination of multiple sources, did not perform any form of filtering? The bookcorpus dataset alone has 74M rows, but I saw that your Roberta folder is named 20M. May I ask what rules you use to filter the final data. I hope to receive your detailed description or if it is possible to publicly disclose your Roberta training data. Thank you for your help.