kibitzing / awesome-llm-data

A repository of information about data used in training large language models (LLMs)
0 stars 0 forks source link

GPT Pre-training Data #3

Open kibitzing opened 3 months ago

kibitzing commented 3 months ago

GPT-3 data mix

Screenshot 2024-06-19 at 9 26 30 AM

High-quality datasets

Data preparation process

  1. We downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora
  2. We performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting
  3. We also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity
kibitzing commented 3 months ago

Common Crawl

kibitzing commented 3 months ago

WebText 1 (from Language Models are Unsupervised Multitask Learners)

kibitzing commented 3 months ago

WebText 2 (from Scaling Laws for Neural Language Models)

kibitzing commented 3 months ago

Book Corpus

The paper does not clearly explain book1 and book2. Therefore, we can only speculate about these. Here are several articles that offer interesting conjectures.

  1. https://gregoreite.com/drilling-down-details-on-the-ai-training-datasets/
  2. https://towardsdatascience.com/dirty-secrets-of-bookcorpus-a-key-dataset-in-machine-learning-6ee2927e8650
  3. https://github.com/soskek/bookcorpus/issues/27

Other References: