Overlap between Common Crawl and C4

codesoap commented 1 year ago

The C4 dataset is summarized as "A colossal, cleaned version of Common Crawl's web crawl corpus.". I am confused why this dataset is used in addition to the Common Crawl dataset. Am I mistaken in the understanding, that C4 overlaps completely with Common Crawl and that using both introduces nothing but duplication?

vince62s commented 1 year ago

from paper: C4 [15%]. During exploratory experiments, we observed that using diverse pre-processed Com- monCrawl datasets improves performance. We thus included the publicly available C4 dataset (Raffel et al., 2020) in our data. The preprocessing of C4 also contains deduplication and language identifi- cation steps: the main difference with CCNet is the quality filtering, which mostly relies on heuris- tics such as presence of punctuation marks or the number of words and sentences in a webpage.

codesoap commented 1 year ago

Thanks for the input. The passage you quoted seems to stem from the LLaMa whitepaper. This seems like something that could have been a coincidence. Can we be sure, that RedPajama is similar enough to the LLaMa dataset, that the same findings can be applied here? Has anyone compared the perplexity of two networks trained with and without the C4 data?

mauriceweber commented 1 year ago

Hi @codesoap ! Thanks for your question. You're right, the quoted passage comes from the LLama paper -- when building the RP dataset, our goal was to replicate the training data used in LLama, so we tried our best to follow the recipe as closely as possible.

Regarding the overlap between C4 and the CC slice in RP: C4 is based on the April 2019 CC dump (checkout pp. 5-7 in https://arxiv.org/pdf/1910.10683.pdf); the CC slices are based on different dumps (July for 2019). That said, there will definitely be some overlap, but not excessively much. You can check out the statistics here to get a feeling about how much the snapshots overlap: https://commoncrawl.github.io/cc-crawl-statistics/plots/crawloverlap.

Let me know if this helps!

codesoap commented 1 year ago

Hi @mauriceweber, thanks for chiming in! I'm afraid I still have trouble understanding how the common crawl dataset works. By looking at the common crawl stats page you've linked and assuming that the dates in the first graph there are in the format year-week (I think this is what this code confirms), I derive the following:

Common crawl releases a new dataset roughly every one to three months.
The crawls of April and July of 2019 are unfortunately not included in the evaluation.
The dataset that directly follows a previous one (e.g. 2019-35 → 2019-39), usually has a Jaccard similarity of ≤0.02, meaning there is almost no overlap. It doesn't matter whether the two neighboring crawls were created one or three months apart.
The second dataset after an initial one (e.g. 2019-35 → 2019-43) usually has a Jaccard similarity of ≥0.15, meaning the two dumps overlap significantly.
The datasets following that, usually have a decreasing similarity to the initial one (e.g. 2019-35 → 2019-47).

These finding leave me very confused. I would have assumed, that a newer crawl would contain almost everything, that a previous crawl includes, since websites rarely go offline. Thus I would have expected a very high overlap between neighboring crawls (≥0.4).

As time goes on, I would have assumed, that the Jaccard similarity would slowly decrease, as new websites and webpages are included, that did not exist in the old crawl.

I must be misunderstanding something fundamental about common crawl. Do you see where the error in my train of thought is? Sorry, if I'm going somewhat off topic here. If there is any documentation on common crawl, that I'm missing, please feel free to just point me there instead.

mauriceweber commented 1 year ago

I think CC is using a new seed of urls in every crawl that they run -- so the seed results in some sense in a random collection of urls visited, which is why there is generally only limited overlap (also between neighbouring crawls).

My best guess about the observation that sim(crawli, crawl{i+1}) ~ 0 but sim(crawli, crawl{i+2}) ~ 0.15 is that they perform relatively strict deduplication of urls for neighboring crawls, but not for the subsequent ones? And then the similarity again decreases as time passes (more dead urls, different content etc.)

This is also a useful resource you can check out: http://nlpl.eu/skeikampen23/nagel.230206.pdf

codesoap commented 1 year ago

Thanks a ton, @mauriceweber! I couldn't find this information on commoncrawl.org or the Wikipedia article, but the first slide of the PDF you linked mentions it: "sample crawls, not a comprehensive crawl". Now I finally understand.

If I read https://commoncrawl.github.io/cc-crawl-statistics/plots/crawlsize correctly, then commoncrawl currently knows about ~75 billion URLs in total and includes ~3 billion of them in a single crawl. Using this python script, I tried to reproduce the Jaccard index between two crawls with randomly selected URLs:

import random
xs = [x for x in range(75_000)]
sample1 = set(random.sample(xs, 3_000))
sample2 = set(random.sample(xs, 3_000))
jaccard_index = len(sample1 & sample2) / len(sample1 | sample2)
print(jaccard_index)

Running this a few times, I get Jaccard indexes around 0.02. This is roughly what we see for sim(crawl_i, crawl_{i+1}) at https://commoncrawl.github.io/cc-crawl-statistics/plots/crawloverlap . I'm still uncertain how the high sim(crawl_i, crawl_{i+2}) indexes fit in here, but overall I feel like I understand now why using commoncrawl and C4 together in the RedPajama dataset doesn't introduce as much overlap as I previously thought.

Thanks again! I'm closing this issue.

togethercomputer / RedPajama-Data

Overlap between Common Crawl and C4 #48