iipc / warc-specifications

Centralised repository for WARC usage specifications.
http://iipc.github.io/warc-specifications/
100 stars 30 forks source link

SHA256 or SHA-256? #80

Open nclarkekb opened 2 years ago

nclarkekb commented 2 years ago

I would like to know the position on this. JWAT used the algorithm specified in the digest header directly. So JWAT expects "SHA-256" since that seems to be the official name and the name supported by JCE. But now I see webrecorder uses "SHA256" which then fails. Maybe Python uses SHA256 instead of SHA-256?

JustAnotherArchivist commented 2 years ago

'SHA-256' is indeed the official name. But that's true for 'SHA-1' as well, yet the digest headers almost universally use the format sha1:foo (since that's what's used in the specification's examples and other documents), so that doesn't mean much. I'd probably go with sha256:foo if I were to implement writing such records, simply for consistency. On parsing, I'd keep it generic and support any capitalisation and an optional dash.

In general though, the digest format is very underspecified, which is understandable but a shame. Not only is the format of the digest algorithm undefined, the same is true for the digest encoding as well: base32 is the common one, but hexadecimal ones exist as well, for example.

ato commented 2 years ago

Observations of WARCs found in the wild (web searches, looking at various tools):


(Proposing some guidelines.)

Guidelines for writing WARC files:

Guidelines for reading WARC files:

Algorithm Label Compat. label Popular encoding Base16 length Base32 length
MD5 md5: lowercase Base16 32† 32†
SHA-1 sha1: sha-1: uppercase Base32 40 32
SHA-224 sha224: sha-224: 56 48
SHA-256 sha256: sha-256: lowercase Base16 64 56
SHA-384 sha384: sha-384: 96 80
SHA-512 sha512: sha-512: 128 104
SHA-512/224 sha512-224: 56 48
SHA-512/256 sha512-256: 64 56
SHA3-224 sha3-224: 56 48
SHA3-256 sha3-256: 64 56
SHA3-384 sha3-384: 96 80
SHA3-512 sha3-512: 128 104
BLAKE2s blake2s: 64 56
BLAKE2b blake2b: 128 104
chris-ha458 commented 1 year ago

Although this thread does not strictly serve as a thread recommending hash algorithms, since I have been directed here by developers of CC here to contribute to the discussion I shall do so thusly.

It is my personal opinion that both speed and cryptographic aspects must be considered. Many LLMs are now relying on Common Crawl and its derivatives as an important upstream data provider, so we must consider the fact that Common Crawl or other WARC producers become targets of "data supply chain" attacks. As such, any additional security properties should be considered beneficial. However, processing cost is an important consideration that must be considered as well. Under this assumption, I would like to recommend 2 different algorithm categories.

Noncryptographic Algorithms

These algorithms consider a pareto optimum outside cryptographic security. That is not to say these are not cryptographically secure; however they have not been designed and validated according to cryptographic principles and are likely will not undergo such considerations in the future. That being said, due to above reasons, any algorithm with actively known vulnerability should not be used.

There are other widely used hash functions but this is the only one that I could find that was fast enough, pass all smhasher tests and is portable (not just x86 32,64bit but ARM)

Cryptographic Algorithms

I would like to recommend 2 algorithms each that serves different functions. SHA3-256 is a NIST released algorithm, and BLAKE3 is a cryptographically secure algorithm that is very fast.

jcushman commented 1 year ago

This thread doesn't yet mention SHA-512/256, which is very like SHA-384: it's a length-extension-resistant NIST algorithm based on truncating SHA-512, and has a similar security budget to SHA-384. Because it truncates to 256 bits instead of 384 the strings are two thirds as long, which I like for applications where humans see them.

A practical question related to the original issue would be what labels to use -- would sha512/256: work or would the slash cause a problem?

(As a total sidetrack, I like the idea of length extension resistance since it can be had for free, so why not? But I don't know of a threat model where length extension is relevant to digests.)

chris-ha458 commented 1 year ago

I think SHA-512/256 is a good alternative too, but they practical question you raise would definitely have to be answered before any real usage. I didn't have a good answer that would definitely work downstream, so i just chose SHA-384.

As for what the threat model would be, I do not know. However, I do not want to find out the hard, difficult way, and it is already established that changing hashes are not easy to accomplish due to legacy code and downstream dependencies. As such maximalist security approaches are justified imo.

jcushman commented 1 year ago

I agree about the precautionary principle!

It does strike me as worth writing down somewhere what the security model for digests is. I'm not qualified, but my starting place would be like:

This could be relevant because there are archive provenance efforts like wacz-auth or C2PA which do involve signatures. E.g.:

Just rambling on the weekend... :)

JustAnotherArchivist commented 1 year ago

I mostly agree, but collision resistance might also be somewhat important, else bodies that are actually different might get deduped. For example, if there is a public web archival service that uses deduplication on the payload digest, an attacker could put one version of the content into the collection but then serve something different with the same hash on their website. Visitors would see the latter version, but the archival service would always return the former instead. (And this would not rely on things like detecting when the archival service accesses the attacker's website and returning a different response, which is of course always possible.)

Would it be worth allowing the WARC-Block-Digest and WARC-Payload-Digest headers to be repeatable? Then it would be possible to include hashes with more than one algorithm if so desired. While preimage attacks are still practically impossible even for otherwise horribly broken algorithms like MD2, this would provide some additional resistance into the future if/when preimage resistance does get broken. And executing a preimage or collision attack on multiple algorithms at the same time would be more difficult than a single one, possibly even when there exist practical attacks on each individual algorithm.

(Side note, WARC/1.1 calls SHA-1 'a strong digest function'. It isn't, and it wasn't at the time of publication either. WARC/1.1 was published in August 2017. The SHAttered collision came out half a year earlier in February 2017, and the SHAppening feasible collision attack was out in October 2015.)

ato commented 1 year ago

would sha512/256: work or would the slash cause a problem?

Unfortunately / is disallowed by the grammar in the algorithm name as it is part of separators. I've added SHA-512/256 as sha512-256: to the guidelines table.

labelled-digest   = algorithm ":" digest-value
algorithm         = token
digest-value      = token
token         = 1*<any US-ASCII character>
                except CTLs or separators>
separators    = "(" | ")" | "<" | ">" | "@"
              | "," | ";" | ":" | "\" | <">
              | "/" | "[" | "]" | "?" | "="
              | "{" | "}" | SP | HT
jcushman commented 1 year ago

I mostly agree, but collision resistance might also be somewhat important

Ah, good point. In the threat model, an adversary should not be able to add two documents to a web archive with the same digest.

chris-ha458 commented 1 year ago

@ato would it be possible to add another column describing cryptographic security?

I would hope some rows could be added including xxh3 and blake3.

would I need to make a separate PR?

Algorithm Label Compat. label Typical encoding Base16 length Base32 length Cryptographic
MD5 md5: lowercase Base16 32 32 Yes, but broken.
SHA-1 sha1: sha-1: uppercase Base32 40 32 Yes, but broken.
XXH3 xxh3: 16 No, but really fast
XXH128 xxh128: 32 No, but really fast
SHA-224 sha224: sha-224: 56 48 Yes
SHA-256 sha256: sha-256: lowercase Base16 64 56 Yes, but vulnerable to length extension
SHA-384 sha384: sha-384: 96 80 Yes
SHA-512 sha512: sha-512: 128 104 Yes, but vulnerable to length extension
SHA-512/224 sha512-224: 56 48 Yes
SHA-512/256 sha512-256: 64 56 Yes
SHA3-224 sha3-224: 56 48 Yes
SHA3-256 sha3-256: 64 56 Yes
SHA3-384 sha3-384: 96 80 Yes
SHA3-512 sha3-512: 128 104 Yes
BLAKE2s blake2s: 64 56 Yes
BLAKE2b blake2b: 128 104 Yes
BLAKE3 blake3: 64 Yes, and really fast
chris-ha458 commented 1 year ago

Would it be worth allowing the WARC-Block-Digest and WARC-Payload-Digest headers to be repeatable? Then it would be possible to include hashes with more than one algorithm if so desired. While preimage attacks are still practically impossible even for otherwise horribly broken algorithms like MD2, this would provide some additional resistance into the future if/when preimage resistance does get broken. And executing a preimage or collision attack on multiple algorithms at the same time would be more difficult than a single one, possibly even when there exist practical attacks on each individual algorithm.

I think allowing multiple digests would be the way to go.

First, it would make any attack exponentially difficult. If each hashes are cryptographically secure it would provide atleast the sum of their security in bits.

Secondly, it would provide a safe buffer for migration. Even if one of the hashes are deemed to have a security vulnerability the existence of another would allow security to be maintained while an alternative is implemented.

wumpus commented 1 year ago

I really prefer cryptographic hashes because they help me understand how to use them correctly. Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day.

I'm OK with having multiple digests although I think we should test a bunch of existing warc libraries to see how many break.

JustAnotherArchivist commented 1 year ago

The standard does not currently allow repeating the digest headers, so I'm not sure that's relevant. A library that currently accepts repeated headers is not implementing the standard correctly. It could be allowed in a future WARC version, and then libraries implementing support for that version would have to also support the repetitions.

wumpus commented 1 year ago

Most libraries aren't standard-conforming, so it's worth checking how much of a change this proposal will be.

chris-ha458 commented 1 year ago

I really prefer cryptographic hashes because they help me understand how to use them correctly

The specification as it is, does not recommend any hash. Therefore the choice is up to the library developers as it is. Common Crawl, for instance, is using SHA-1 which is neither fast nor cryptographically secure. As a standard, I think it might be difficult to dictate a cryptographically secure standard, especially such securities depend on a lot including threat models. The specification might be able mention that such qualities exist, and should be considered by developers in selecting a hash.

Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day.

I am considering the viablity of this right now. Whether it is a feasible choice either independently or not, is not the scope of this discussion (although i'd welcome to discuss it further in another setting), but the fact remains that these digests exist in both the specification and individual downstream projects such as Common Crawl and that it makes it a very enticing option.

wumpus commented 1 year ago

I'm confused by your comment. I wasn't suggesting that the standard should recommend or dictate any hash. If you think my comment isn't in scope for the discussion, I'm happy to not participate.

chris-ha458 commented 1 year ago

I apologize for the confusion. It seems that I was the one to have opened up this issue into a wider discussion so it seems unfair for me to now arbtitrarily limit the scope of discussion. To address this, I will open a new PR and associated issue to discuss potential new community hash digest recommendations and any qualities(cryptographic security, speed) and functionalities(downstream dedup etc) that such hash must consider.

As i've said, i welcome such discussions especially since I am working on such things right now. I merely wished to avoid this issue being diluted too much (I must admit was my fault).