Closed RicardoDominguez closed 7 months ago
Hi @RicardoDominguez , the SHA1 digests of the documents correspond to the warc-block-digest of the corresponding plaintext .wet document -- in other words, this is computed based on the plaintext extracted from the original html document. The reason why you see two different hashes is that the CCNet pipeline performs sharded paragraph level deduplication and removes paragraphs from documents that are repeated in other documents (in the same shard).
In the example you showed, the two documents are originating from the same plaintext .wet document, but some paragraphs were removed from one document, and some from the other document (as these paragraphs appeared in the respective shards). For example, the paragraph Art and Archaeology
is present in 2021-21/0115/en_middle.json.gz/12666
but not in 2021-25/2449/en_middle.json.gz/17916
.
Let me know if this clarifies things for you.
I see, thank you for the clarification 👍
How is the SHA1 digest for exact deduplication obtained?
In particular, it seems that some files marked as exact duplicates are not exact duplicates, for instance
2021-25/2449/en_middle.json.gz/17916
and2021-21/0115/en_middle.json.gz/12666
.Naively computing the SHA1 of these two documents indeed leads to two different digests, as