Closed jarednielsen closed 4 years ago
As far as I understand, wikitext
is refer to WikiText-103
and WikiText-2
that created by researchers in Salesforce, and mostly used in traditional language modeling.
You might want to say wikipedia
, a dump from wikimedia foundation.
Also I would like to have Toronto BookCorpus too ! Though it involves copyright problem...
Hi, @lhoestq, just a reminder that this is solved by #248 .😉
I know the copyright/distribution of this one is complex, but it would be great to have! That, combined with the existing
wikitext
, would provide a complete dataset for pretraining models like BERT.