Paper questions: Common Crawl processing questions

There are a few details missing from the paper that are required to really understand what data was actually used for training LLAMA.

The paper notes:

We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline

However, the size of crawls within a year varies dramatically. Which crawls were actually used?

Also, CCNet contains a perplexity threshold. Was the default value of 340 used?

Finally, the paper notes:

we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references.

Approximately what % of pages were filtered out by this classifier?

meta-llama / llama

Paper questions: Common Crawl processing questions #296