LLaMa 2 Pre-training data

kibitzing commented 5 months ago

What kind of data were used for training LLaMa 2?

kibitzing commented 5 months ago

LLaMa 2

Pre-training data
- whole data
- quantity: 2 trillion tokens of data
- by data source
- data filtering method

up-sampling the most factual sources in an effort to increase knowledge and dampen hallucinations.
includes a new mix of data from publicly available sources, which does not include data from Meta’s products or services
remove data from certain sites known to contain a high volume of personal information about private individuals

kibitzing commented 5 months ago

It is important to understand what is in the pretraining data both to increase transparency and to shed light on root causes of potential downstream issues, such as potential biases.

We followed Meta’s standard privacy and legal review processes for each dataset used in training.

We did not use any Meta user data in training
We excluded data from certain sites known to contain a high volume of personal information about private individuals.
No additional filtering was conducted on the datasets, to allow Llama 2 to be more widely usable across tasks
- it can be better used for hate speech classification
- avoiding the potential for the accidental demographic erasure sometimes caused by over-scrubbing

kibitzing commented 5 months ago

Demographic Representation

Pronouns

She: "she", "her", "hers", "herself"
He: "he", "him", "his", "himself"
Unknown: "they", "them", "their", "theirs", "theirself", "themself", "themselves"
1st-person: "I", "me", "my", "mine", "myself", "we", "us", "our", "ours", "ourselves"
2nd-person: "you", "your", "yours", "yourself", "yourselves"
3rd-person: "she", "her", "hers", "herself", "he", "him", "his", "himself", "they", "them", "their", "theirs", "theirself", "themself", "themselves", "it", "its", "itself"

Identities

We compute frequencies for each descriptor term in the pretraining corpus. We group descriptors into 5 axes (Religion, Gender and Sex, Nationality, Race and Ethnicity, and Sexual Orientation)

we remove a few terms (from the table) such as “straight,” “white,” and “black,” because these terms have frequent uses beyond demographic mentions (e.g., as basic color terms)
while She pronouns are mentioned in fewer documents, the term “female” is present in a larger percentage of documents. This could imply that while there is less frequent context about She pronouns, comments about “females” are more prevalent
For Nationality, Race and Ethnicity, and Religion, we observe a Western skew
- the term “American” is mentioned in 69.4% of the references, the term “European” is more prevalent than other race and ethnicity
- “Christian” is the most represented religion followed by “Catholic” and “Jewish.”

kibitzing commented 5 months ago

Category from “I’m sorry to hear that”: Finding New Biases in Language Models with a Holistic Descriptor Dataset

kibitzing commented 5 months ago

Data Toxicity

We score each line of a document separately and average them to assign a document score. -> The toxicity can be diluted when averaged

Figure 13 shows the distribution of scores in a 10% random sample of the full corpus.
About 0.2% of documents evaluated are assigned a likelihood score of 0.5
meaning there is a small amount of toxicity in our pretraining data. -> Maybe, but maybe not, because 0.5 might mean that 50% of the document is toxic speech, and it could also mean that 49%-toxic documents are not counted.

kibitzing commented 5 months ago

Language identification

subsetted to those found in more than 0.005% of the documents
a threshold of 0.5 for the language detection

https://fasttext.cc/docs/en/language-identification.html

kibitzing commented 5 months ago

Pre-training data summary:

They did not use Meta's data
They filtered the web pages (which are known to contain a high volume of personal information about private individuals)
- SNS like Twitter and LinkedIn?
They did not filter out any pre-training data
Instead, they provided multiple analyses of the data.

kibitzing / awesome-llm-data