GPT Pre-training Data - Githubissues

kibitzing commented 3 months ago

GPT-3 data mix

Datasets are not sampled in proportion to their size
Datasets we view as higher-quality are sampled more frequently
- WebText2, Book1, Wikipedia datasets are sampled 2-3 times.
(Relatively) Low quality dataset like CommonCrawl and Books2 datasets are sampled less than once during training

High-quality datasets

including an expanded version of the WebText dataset [RWC+19] collected by scraping links over a longer period of time, and first described in [KMH+20],
two internet-based books corpora (Books1 and Books2)
English-language Wikipedia.

Data preparation process

We downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora
We performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting
We also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity

kibitzing commented 3 months ago

Common Crawl

downloaded from 41 shards of monthly CommonCrawl covering 2016 to 2019
Constituting 45TB of compressed plaintext before filtering
570GB after filtering (filtered a lot, only 1.27%)
Roughly equivalent to 400 billion byte-pair-encoded tokens

kibitzing commented 3 months ago

WebText 1 (from Language Models are Unsupervised Multitask Learners)

Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible.
Common Crawl is large and diverse, but they have significant data quality issues.
To improve document quality, we only scraped web pages which have been curated/filtered by humans.
- We scraped all outbound links from Reddit, a social media platform, which received at least 3 karma.
- This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny.
The resulting dataset, WebText, contains the text subset of these 45 million links.
We removed all Wikipedia documents
All results presented in this paper use a preliminary version of WebText
- does not include links created after Dec 2017
- contains slightly over 8 million documents for a total of 40 GB of text after de-duplication and some heuristic based cleaning
To extract the text from HTML responses we use a combination of the Dragnet and Newspaper content extractors.

kibitzing commented 3 months ago

WebText 2 (from Scaling Laws for Neural Language Models)

An extended version of the WebText 1
- + Outbound Reddit links from the period of January to October 2018 also with a minimum of 3 karma.
The text of the new links was extracted with the Newspaper3k python library.
In total, the dataset consists of
- 20.3M documents containing 96 GB of text and 16.2B words (as defined by wc).
- We then apply the reversible tokenizer described in [RWC+19], which yields 22.9B tokens.
We reserve 660M tokens for use as a test set

kibitzing commented 3 months ago

Book Corpus

The paper does not clearly explain book1 and book2. Therefore, we can only speculate about these. Here are several articles that offer interesting conjectures.

Other References:

https://yknzhu.wixsite.com/mbweb
https://arxiv.org/pdf/1506.06724
- 74 Million sentences, 984 Million words
https://www.smashwords.com/
https://arxiv.org/pdf/2105.05241
https://arxiv.org/pdf/1803.09010
https://arxiv.org/pdf/2101.00027
https://github.com/jackbandy/bookcorpus-datasheet
https://huggingface.co/datasets/bookcorpus/bookcorpus

kibitzing / awesome-llm-data

GPT Pre-training Data #3

GPT-3 data mix

High-quality datasets

Data preparation process

Common Crawl

WebText 1 (from Language Models are Unsupervised Multitask Learners)

WebText 2 (from Scaling Laws for Neural Language Models)

Book Corpus

Other References: