meta-llama / llama

Inference code for Llama models
Other
56.6k stars 9.59k forks source link

Paper questions: Common Crawl processing questions #296

Open joshalbrecht opened 1 year ago

joshalbrecht commented 1 year ago

There are a few details missing from the paper that are required to really understand what data was actually used for training LLAMA.

The paper notes:

We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline

However, the size of crawls within a year varies dramatically. Which crawls were actually used?

Also, CCNet contains a perplexity threshold. Was the default value of 340 used?

Finally, the paper notes:

we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references.

Approximately what % of pages were filtered out by this classifier?

kaolin commented 1 year ago

@glample