Datasets are not sampled in proportion to their size
Datasets we view as higher-quality are sampled more frequently
WebText2, Book1, Wikipedia datasets are sampled 2-3 times.
(Relatively) Low quality dataset like CommonCrawl and Books2 datasets are sampled less than once during training
High-quality datasets
including an expanded version of the WebText dataset [RWC+19] collected by scraping links over a longer period of time, and first described in [KMH+20],
two internet-based books corpora (Books1 and Books2)
English-language Wikipedia.
Data preparation process
We downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora
We performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting
We also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity
Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.
Common Crawl is large and diverse, but they have significant data quality issues.
To improve document quality, we only scraped web pages which have been curated/filtered by humans.
We scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.
This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
The resulting dataset, WebText, contains the text subset of these 45 million links.
We removed all Wikipedia documents
All results presented in this paper use a preliminary version of WebText
does not include links created after Dec 2017
contains slightly over 8 million documents for a total of 40 GB of text after de-duplication and some heuristic based cleaning
The paper does not clearly explain book1 and book2. Therefore, we can only speculate about these. Here are several articles that offer interesting conjectures.
GPT-3 data mix
High-quality datasets
Data preparation process