kibitzing / awesome-llm-data

A repository of information about data used in training large language models (LLMs)
0 stars 0 forks source link

LLaMa 2 Pre-training data #1

Open kibitzing opened 5 months ago

kibitzing commented 5 months ago

What kind of data were used for training LLaMa 2?

kibitzing commented 5 months ago

LLaMa 2

  1. Pre-training data
    • whole data
    • quantity: 2 trillion tokens of data
    • by data source
    • data filtering method
kibitzing commented 5 months ago

It is important to understand what is in the pretraining data both to increase transparency and to shed light on root causes of potential downstream issues, such as potential biases.

We followed Meta’s standard privacy and legal review processes for each dataset used in training.

kibitzing commented 5 months ago

Demographic Representation

Pronouns

Screenshot 2024-06-16 at 8 24 28 PM

Identities

Screenshot 2024-06-16 at 8 33 45 PM
kibitzing commented 5 months ago

Category from “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset

Screenshot 2024-06-16 at 8 40 07 PM
kibitzing commented 5 months ago

Data Toxicity

Screenshot 2024-06-16 at 8 46 39 PM
kibitzing commented 5 months ago

Language identification

Screenshot 2024-06-16 at 8 55 22 PM

https://fasttext.cc/docs/en/language-identification.html

kibitzing commented 5 months ago

Pre-training data summary:

  1. They did not use Meta's data
  2. They filtered the web pages (which are known to contain a high volume of personal information about private individuals)
    • SNS like Twitter and LinkedIn?
  3. They did not filter out any pre-training data
  4. Instead, they provided multiple analyses of the data.