Open nclarkekb opened 2 years ago
'SHA-256' is indeed the official name. But that's true for 'SHA-1' as well, yet the digest headers almost universally use the format sha1:foo
(since that's what's used in the specification's examples and other documents), so that doesn't mean much. I'd probably go with sha256:foo
if I were to implement writing such records, simply for consistency. On parsing, I'd keep it generic and support any capitalisation and an optional dash.
In general though, the digest format is very underspecified, which is understandable but a shame. Not only is the format of the digest algorithm undefined, the same is true for the digest encoding as well: base32 is the common one, but hexadecimal ones exist as well, for example.
Observations of WARCs found in the wild (web searches, looking at various tools):
sha-256:
but changed to using sha256:
(Proposing some guidelines.)
Guidelines for writing WARC files:
Guidelines for reading WARC files:
SHA-1:
-> sha1:
).Algorithm | Label | Compat. label | Popular encoding | Base16 length | Base32 length |
---|---|---|---|---|---|
MD5 | md5: |
lowercase Base16 | 32† | 32† | |
SHA-1 | sha1: |
sha-1: |
uppercase Base32 | 40 | 32 |
SHA-224 | sha224: |
sha-224: |
56 | 48 | |
SHA-256 | sha256: |
sha-256: |
lowercase Base16 | 64 | 56 |
SHA-384 | sha384: |
sha-384: |
96 | 80 | |
SHA-512 | sha512: |
sha-512: |
128 | 104 | |
SHA-512/224 | sha512-224: |
56 | 48 | ||
SHA-512/256 | sha512-256: |
64 | 56 | ||
SHA3-224 | sha3-224: |
56 | 48 | ||
SHA3-256 | sha3-256: |
64 | 56 | ||
SHA3-384 | sha3-384: |
96 | 80 | ||
SHA3-512 | sha3-512: |
128 | 104 | ||
BLAKE2s | blake2s: |
64 | 56 | ||
BLAKE2b | blake2b: |
128 | 104 |
Although this thread does not strictly serve as a thread recommending hash algorithms, since I have been directed here by developers of CC here to contribute to the discussion I shall do so thusly.
It is my personal opinion that both speed and cryptographic aspects must be considered. Many LLMs are now relying on Common Crawl and its derivatives as an important upstream data provider, so we must consider the fact that Common Crawl or other WARC producers become targets of "data supply chain" attacks. As such, any additional security properties should be considered beneficial. However, processing cost is an important consideration that must be considered as well. Under this assumption, I would like to recommend 2 different algorithm categories.
These algorithms consider a pareto optimum outside cryptographic security. That is not to say these are not cryptographically secure; however they have not been designed and validated according to cryptographic principles and are likely will not undergo such considerations in the future. That being said, due to above reasons, any algorithm with actively known vulnerability should not be used.
There are other widely used hash functions but this is the only one that I could find that was fast enough, pass all smhasher tests and is portable (not just x86 32,64bit but ARM)
I would like to recommend 2 algorithms each that serves different functions. SHA3-256 is a NIST released algorithm, and BLAKE3 is a cryptographically secure algorithm that is very fast.
SHA-384 Among the SHA-256 family, SHA-256 and SHA-512 can be vulnerable to length extension attacks. SHA-384 has 128 bits of length extension resistance. Since the WARC format expects to have lengths included so this is somewhat ameliorated but it cannot hurt to have another layer of defence if we are aiming for cryptographic security. Due to being an NIST algorithm it is amongst the most portable.
BLAKE3 Amongst the non NIST algorithms, BLAKE3 is amongst the fastest. Since it is a singular algorithm (as opposed to multiple different variants of BLAKE2) choice is easier. It is anywhere between 6~15 times faster than other NIST algorithms. It is known to have official Rust and C implementations and is compatible with x86 and ARM architectures.
This thread doesn't yet mention SHA-512/256, which is very like SHA-384: it's a length-extension-resistant NIST algorithm based on truncating SHA-512, and has a similar security budget to SHA-384. Because it truncates to 256 bits instead of 384 the strings are two thirds as long, which I like for applications where humans see them.
A practical question related to the original issue would be what labels to use -- would sha512/256:
work or would the slash cause a problem?
(As a total sidetrack, I like the idea of length extension resistance since it can be had for free, so why not? But I don't know of a threat model where length extension is relevant to digests.)
I think SHA-512/256 is a good alternative too, but they practical question you raise would definitely have to be answered before any real usage. I didn't have a good answer that would definitely work downstream, so i just chose SHA-384.
As for what the threat model would be, I do not know. However, I do not want to find out the hard, difficult way, and it is already established that changing hashes are not easy to accomplish due to legacy code and downstream dependencies. As such maximalist security approaches are justified imo.
I agree about the precautionary principle!
It does strike me as worth writing down somewhere what the security model for digests is. I'm not qualified, but my starting place would be like:
This could be relevant because there are archive provenance efforts like wacz-auth or C2PA which do involve signatures. E.g.:
Just rambling on the weekend... :)
I mostly agree, but collision resistance might also be somewhat important, else bodies that are actually different might get deduped. For example, if there is a public web archival service that uses deduplication on the payload digest, an attacker could put one version of the content into the collection but then serve something different with the same hash on their website. Visitors would see the latter version, but the archival service would always return the former instead. (And this would not rely on things like detecting when the archival service accesses the attacker's website and returning a different response, which is of course always possible.)
Would it be worth allowing the WARC-Block-Digest
and WARC-Payload-Digest
headers to be repeatable? Then it would be possible to include hashes with more than one algorithm if so desired. While preimage attacks are still practically impossible even for otherwise horribly broken algorithms like MD2, this would provide some additional resistance into the future if/when preimage resistance does get broken. And executing a preimage or collision attack on multiple algorithms at the same time would be more difficult than a single one, possibly even when there exist practical attacks on each individual algorithm.
(Side note, WARC/1.1 calls SHA-1 'a strong digest function'. It isn't, and it wasn't at the time of publication either. WARC/1.1 was published in August 2017. The SHAttered collision came out half a year earlier in February 2017, and the SHAppening feasible collision attack was out in October 2015.)
would sha512/256: work or would the slash cause a problem?
Unfortunately /
is disallowed by the grammar in the algorithm name as it is part of separators
. I've added SHA-512/256 as sha512-256:
to the guidelines table.
labelled-digest = algorithm ":" digest-value
algorithm = token
digest-value = token
token = 1*<any US-ASCII character>
except CTLs or separators>
separators = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
I mostly agree, but collision resistance might also be somewhat important
Ah, good point. In the threat model, an adversary should not be able to add two documents to a web archive with the same digest.
@ato would it be possible to add another column describing cryptographic security?
I would hope some rows could be added including xxh3 and blake3.
would I need to make a separate PR?
Algorithm | Label | Compat. label | Typical encoding | Base16 length | Base32 length | Cryptographic |
---|---|---|---|---|---|---|
MD5 | md5: |
lowercase Base16 | 32 | 32 | Yes, but broken. | |
SHA-1 | sha1: |
sha-1: |
uppercase Base32 | 40 | 32 | Yes, but broken. |
XXH3 | xxh3: |
16 | No, but really fast | |||
XXH128 | xxh128: |
32 | No, but really fast | |||
SHA-224 | sha224: |
sha-224: |
56 | 48 | Yes | |
SHA-256 | sha256: |
sha-256: |
lowercase Base16 | 64 | 56 | Yes, but vulnerable to length extension |
SHA-384 | sha384: |
sha-384: |
96 | 80 | Yes | |
SHA-512 | sha512: |
sha-512: |
128 | 104 | Yes, but vulnerable to length extension | |
SHA-512/224 | sha512-224: |
56 | 48 | Yes | ||
SHA-512/256 | sha512-256: |
64 | 56 | Yes | ||
SHA3-224 | sha3-224: |
56 | 48 | Yes | ||
SHA3-256 | sha3-256: |
64 | 56 | Yes | ||
SHA3-384 | sha3-384: |
96 | 80 | Yes | ||
SHA3-512 | sha3-512: |
128 | 104 | Yes | ||
BLAKE2s | blake2s: |
64 | 56 | Yes | ||
BLAKE2b | blake2b: |
128 | 104 | Yes | ||
BLAKE3 | blake3: |
64 | Yes, and really fast |
Would it be worth allowing the
WARC-Block-Digest
andWARC-Payload-Digest
headers to be repeatable? Then it would be possible to include hashes with more than one algorithm if so desired. While preimage attacks are still practically impossible even for otherwise horribly broken algorithms like MD2, this would provide some additional resistance into the future if/when preimage resistance does get broken. And executing a preimage or collision attack on multiple algorithms at the same time would be more difficult than a single one, possibly even when there exist practical attacks on each individual algorithm.
I think allowing multiple digests would be the way to go.
First, it would make any attack exponentially difficult. If each hashes are cryptographically secure it would provide atleast the sum of their security in bits.
Secondly, it would provide a safe buffer for migration. Even if one of the hashes are deemed to have a security vulnerability the existence of another would allow security to be maintained while an alternative is implemented.
I really prefer cryptographic hashes because they help me understand how to use them correctly. Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day.
I'm OK with having multiple digests although I think we should test a bunch of existing warc libraries to see how many break.
The standard does not currently allow repeating the digest headers, so I'm not sure that's relevant. A library that currently accepts repeated headers is not implementing the standard correctly. It could be allowed in a future WARC version, and then libraries implementing support for that version would have to also support the repetitions.
Most libraries aren't standard-conforming, so it's worth checking how much of a change this proposal will be.
I really prefer cryptographic hashes because they help me understand how to use them correctly
The specification as it is, does not recommend any hash. Therefore the choice is up to the library developers as it is. Common Crawl, for instance, is using SHA-1 which is neither fast nor cryptographically secure. As a standard, I think it might be difficult to dictate a cryptographically secure standard, especially such securities depend on a lot including threat models. The specification might be able mention that such qualities exist, and should be considered by developers in selecting a hash.
Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day.
I am considering the viablity of this right now. Whether it is a feasible choice either independently or not, is not the scope of this discussion (although i'd welcome to discuss it further in another setting), but the fact remains that these digests exist in both the specification and individual downstream projects such as Common Crawl and that it makes it a very enticing option.
I'm confused by your comment. I wasn't suggesting that the standard should recommend or dictate any hash. If you think my comment isn't in scope for the discussion, I'm happy to not participate.
I apologize for the confusion. It seems that I was the one to have opened up this issue into a wider discussion so it seems unfair for me to now arbtitrarily limit the scope of discussion. To address this, I will open a new PR and associated issue to discuss potential new community hash digest recommendations and any qualities(cryptographic security, speed) and functionalities(downstream dedup etc) that such hash must consider.
As i've said, i welcome such discussions especially since I am working on such things right now. I merely wished to avoid this issue being diluted too much (I must admit was my fault).
I would like to know the position on this. JWAT used the algorithm specified in the digest header directly. So JWAT expects "SHA-256" since that seems to be the official name and the name supported by JCE. But now I see webrecorder uses "SHA256" which then fails. Maybe Python uses SHA256 instead of SHA-256?