There are a few details missing from the paper that are required to really understand what data was actually used for training LLAMA.
The paper notes:
We preprocess five CommonCrawl dumps, ranging from 2017 to 2020, with the CCNet pipeline
However, the size of crawls within a year varies dramatically. Which crawls were actually used?
Also, CCNet contains a perplexity threshold. Was the default value of 340 used?
Finally, the paper notes:
we trained a linear model to classify pages used as references in Wikipedia v.s. randomly sampled pages, and discarded pages not classified as references.
Approximately what % of pages were filtered out by this classifier?
There are a few details missing from the paper that are required to really understand what data was actually used for training LLAMA.
The paper notes:
However, the size of crawls within a year varies dramatically. Which crawls were actually used?
Also, CCNet contains a perplexity threshold. Was the default value of 340 used?
Finally, the paper notes:
Approximately what % of pages were filtered out by this classifier?