huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.25k stars 2.69k forks source link

[Feature request] Add Toronto BookCorpus dataset #131

Closed jarednielsen closed 4 years ago

jarednielsen commented 4 years ago

I know the copyright/distribution of this one is complex, but it would be great to have! That, combined with the existing wikitext, would provide a complete dataset for pretraining models like BERT.

richarddwang commented 4 years ago

As far as I understand, wikitext is refer to WikiText-103 and WikiText-2 that created by researchers in Salesforce, and mostly used in traditional language modeling.

You might want to say wikipedia, a dump from wikimedia foundation.

Also I would like to have Toronto BookCorpus too ! Though it involves copyright problem...

richarddwang commented 4 years ago

Hi, @lhoestq, just a reminder that this is solved by #248 .😉