How is the SHA1 digest computed?

RicardoDominguez commented 7 months ago

How is the SHA1 digest for exact deduplication obtained?

In particular, it seems that some files marked as exact duplicates are not exact duplicates, for instance 2021-25/2449/en_middle.json.gz/17916 and 2021-21/0115/en_middle.json.gz/12666.

Naively computing the SHA1 of these two documents indeed leads to two different digests, as

print(doc1['digest'])  #output: sha1:IV4DNI4TBSMLL7GKU6VUNK4TEJHLVHLT
print(doc2['digest'])  #output: sha1:IV4DNI4TBSMLL7GKU6VUNK4TEJHLVHLT
print(hashlib.sha1(doc1['raw_content'].encode('utf-8')).hexdigest())  #output: 3f4575a021f02a637ac9b04a4a26ff3b2ca89150
print(hashlib.sha1(doc2['raw_content'].encode('utf-8')).hexdigest())  #output: 14ac252de7c9674beb0cb8b3678fa204fd4661fe

mauriceweber commented 7 months ago

Hi @RicardoDominguez , the SHA1 digests of the documents correspond to the warc-block-digest of the corresponding plaintext .wet document -- in other words, this is computed based on the plaintext extracted from the original html document. The reason why you see two different hashes is that the CCNet pipeline performs sharded paragraph level deduplication and removes paragraphs from documents that are repeated in other documents (in the same shard).

In the example you showed, the two documents are originating from the same plaintext .wet document, but some paragraphs were removed from one document, and some from the other document (as these paragraphs appeared in the respective shards). For example, the paragraph Art and Archaeology is present in 2021-21/0115/en_middle.json.gz/12666 but not in 2021-25/2449/en_middle.json.gz/17916.

Let me know if this clarifies things for you.

RicardoDominguez commented 7 months ago

I see, thank you for the clarification 👍

togethercomputer / RedPajama-Data

How is the SHA1 digest computed? #81