SHA256 or SHA-256? - Githubissues

nclarkekb commented 2 years ago

I would like to know the position on this. JWAT used the algorithm specified in the digest header directly. So JWAT expects "SHA-256" since that seems to be the official name and the name supported by JCE. But now I see webrecorder uses "SHA256" which then fails. Maybe Python uses SHA256 instead of SHA-256?

JustAnotherArchivist commented 2 years ago

'SHA-256' is indeed the official name. But that's true for 'SHA-1' as well, yet the digest headers almost universally use the format sha1:foo (since that's what's used in the specification's examples and other documents), so that doesn't mean much. I'd probably go with sha256:foo if I were to implement writing such records, simply for consistency. On parsing, I'd keep it generic and support any capitalisation and an optional dash.

In general though, the digest format is very underspecified, which is understandable but a shame. Not only is the format of the digest algorithm undefined, the same is true for the digest encoding as well: base32 is the common one, but hexadecimal ones exist as well, for example.

ato commented 2 years ago

Observations of WARCs found in the wild (web searches, looking at various tools):

Observed algorithms: SHA-1, MD5, SHA-256
SHA-1 is usually uppercase Base32 encoded (as recommended by the spec).
MD5 and SHA-256 seem to be usually lowercase Base16 (hexadecimal) encoded.
- Perhaps Base32 is being avoided for these because they'd use = padding which the grammar disallows, see #48.
warcio.js originally used sha-256: but changed to using sha256:

(Proposing some guidelines.)

Guidelines for writing WARC files:

Output the labels in lowercase as shown in the table below. Do not output the compatibility labels.
When outputting Base16 use lowercase.
When outputting Base32 use uppercase.

Guidelines for reading WARC files:

Normalize the label by ASCII lowercasing it and replacing compatibility labels with the recommended label (SHA-1: -> sha1:).
Determine whether the encoding is Base16 or Base32 using the length of the encoded digest (see table).
- † In the case of MD5 the Base16 and Base32 encoding are both 32 characters long. Detect Base32 by the presence of a padding "=" character at the end.
When comparing digests to each other first decode or normalize the base encoding.
Accept both ASCII uppercase and ASCII lowercase letters when decoding the Base16 or Base32 encoding.

Algorithm	Label	Compat. label	Popular encoding	Base16 length	Base32 length
MD5	`md5:`		lowercase Base16	32†	32†
SHA-1	`sha1:`	`sha-1:`	uppercase Base32	40	32
SHA-224	`sha224:`	`sha-224:`		56	48
SHA-256	`sha256:`	`sha-256:`	lowercase Base16	64	56
SHA-384	`sha384:`	`sha-384:`		96	80
SHA-512	`sha512:`	`sha-512:`		128	104
SHA-512/224	`sha512-224:`			56	48
SHA-512/256	`sha512-256:`			64	56
SHA3-224	`sha3-224:`			56	48
SHA3-256	`sha3-256:`			64	56
SHA3-384	`sha3-384:`			96	80
SHA3-512	`sha3-512:`			128	104
BLAKE2s	`blake2s:`			64	56
BLAKE2b	`blake2b:`			128	104

chris-ha458 commented 1 year ago

Although this thread does not strictly serve as a thread recommending hash algorithms, since I have been directed here by developers of CC here to contribute to the discussion I shall do so thusly.

It is my personal opinion that both speed and cryptographic aspects must be considered. Many LLMs are now relying on Common Crawl and its derivatives as an important upstream data provider, so we must consider the fact that Common Crawl or other WARC producers become targets of "data supply chain" attacks. As such, any additional security properties should be considered beneficial. However, processing cost is an important consideration that must be considered as well. Under this assumption, I would like to recommend 2 different algorithm categories.

Noncryptographic Algorithms

These algorithms consider a pareto optimum outside cryptographic security. That is not to say these are not cryptographically secure; however they have not been designed and validated according to cryptographic principles and are likely will not undergo such considerations in the future. That being said, due to above reasons, any algorithm with actively known vulnerability should not be used.

The XXH family These algorithm library provides a vectorized XXH3(64bit), XXH128 (128bit) which supported by both x86 and ARM platforms, and the non vectorized XXH64. (there is XXH32 but it is not faster). According to the official github, they are 30~100 times faster than SHA1 It is also considered a high quality hash according to smhasher

There are other widely used hash functions but this is the only one that I could find that was fast enough, pass all smhasher tests and is portable (not just x86 32,64bit but ARM)

Cryptographic Algorithms

I would like to recommend 2 algorithms each that serves different functions. SHA3-256 is a NIST released algorithm, and BLAKE3 is a cryptographically secure algorithm that is very fast.

SHA-384 Among the SHA-256 family, SHA-256 and SHA-512 can be vulnerable to length extension attacks. SHA-384 has 128 bits of length extension resistance. Since the WARC format expects to have lengths included so this is somewhat ameliorated but it cannot hurt to have another layer of defence if we are aiming for cryptographic security. Due to being an NIST algorithm it is amongst the most portable.
BLAKE3 Amongst the non NIST algorithms, BLAKE3 is amongst the fastest. Since it is a singular algorithm (as opposed to multiple different variants of BLAKE2) choice is easier. It is anywhere between 6~15 times faster than other NIST algorithms. It is known to have official Rust and C implementations and is compatible with x86 and ARM architectures.

jcushman commented 1 year ago

This thread doesn't yet mention SHA-512/256, which is very like SHA-384: it's a length-extension-resistant NIST algorithm based on truncating SHA-512, and has a similar security budget to SHA-384. Because it truncates to 256 bits instead of 384 the strings are two thirds as long, which I like for applications where humans see them.

A practical question related to the original issue would be what labels to use -- would sha512/256: work or would the slash cause a problem?

(As a total sidetrack, I like the idea of length extension resistance since it can be had for free, so why not? But I don't know of a threat model where length extension is relevant to digests.)

chris-ha458 commented 1 year ago

I think SHA-512/256 is a good alternative too, but they practical question you raise would definitely have to be answered before any real usage. I didn't have a good answer that would definitely work downstream, so i just chose SHA-384.

As for what the threat model would be, I do not know. However, I do not want to find out the hard, difficult way, and it is already established that changing hashes are not easy to accomplish due to legacy code and downstream dependencies. As such maximalist security approaches are justified imo.

jcushman commented 1 year ago

I agree about the precautionary principle!

It does strike me as worth writing down somewhere what the security model for digests is. I'm not qualified, but my starting place would be like:

Uniform distribution DOES matter, for deduplication and fixity checks
Preimage resistance DOES matter, so attackers can't make pages that mess with deduplication
Other properties do NOT matter(?), because digests aren't signatures

This could be relevant because there are archive provenance efforts like wacz-auth or C2PA which do involve signatures. E.g.:

If someone is picking a digest format for its crypto properties in the first place, that might mean they need to be referred to a signature scheme
If someone is designing a signature scheme, they might need to know which digest formats are sufficient to establish a link between headers and data

Just rambling on the weekend... :)

JustAnotherArchivist commented 1 year ago

I mostly agree, but collision resistance might also be somewhat important, else bodies that are actually different might get deduped. For example, if there is a public web archival service that uses deduplication on the payload digest, an attacker could put one version of the content into the collection but then serve something different with the same hash on their website. Visitors would see the latter version, but the archival service would always return the former instead. (And this would not rely on things like detecting when the archival service accesses the attacker's website and returning a different response, which is of course always possible.)

Would it be worth allowing the WARC-Block-Digest and WARC-Payload-Digest headers to be repeatable? Then it would be possible to include hashes with more than one algorithm if so desired. While preimage attacks are still practically impossible even for otherwise horribly broken algorithms like MD2, this would provide some additional resistance into the future if/when preimage resistance does get broken. And executing a preimage or collision attack on multiple algorithms at the same time would be more difficult than a single one, possibly even when there exist practical attacks on each individual algorithm.

(Side note, WARC/1.1 calls SHA-1 'a strong digest function'. It isn't, and it wasn't at the time of publication either. WARC/1.1 was published in August 2017. The SHAttered collision came out half a year earlier in February 2017, and the SHAppening feasible collision attack was out in October 2015.)

ato commented 1 year ago

would sha512/256: work or would the slash cause a problem?

Unfortunately / is disallowed by the grammar in the algorithm name as it is part of separators. I've added SHA-512/256 as sha512-256: to the guidelines table.

labelled-digest   = algorithm ":" digest-value
algorithm         = token
digest-value      = token
token         = 1*<any US-ASCII character>
                except CTLs or separators>
separators    = "(" | ")" | "<" | ">" | "@"
              | "," | ";" | ":" | "\" | <">
              | "/" | "[" | "]" | "?" | "="
              | "{" | "}" | SP | HT

jcushman commented 1 year ago

I mostly agree, but collision resistance might also be somewhat important

Ah, good point. In the threat model, an adversary should not be able to add two documents to a web archive with the same digest.

chris-ha458 commented 1 year ago

@ato would it be possible to add another column describing cryptographic security?

I would hope some rows could be added including xxh3 and blake3.

would I need to make a separate PR?

Algorithm	Label	Compat. label	Typical encoding	Base16 length	Base32 length	Cryptographic
MD5	`md5:`		lowercase Base16	32	32	Yes, but broken.
SHA-1	`sha1:`	`sha-1:`	uppercase Base32	40	32	Yes, but broken.
XXH3	`xxh3:`		16			No, but really fast
XXH128	`xxh128:`		32			No, but really fast
SHA-224	`sha224:`	`sha-224:`		56	48	Yes
SHA-256	`sha256:`	`sha-256:`	lowercase Base16	64	56	Yes, but vulnerable to length extension
SHA-384	`sha384:`	`sha-384:`		96	80	Yes
SHA-512	`sha512:`	`sha-512:`		128	104	Yes, but vulnerable to length extension
SHA-512/224	`sha512-224:`			56	48	Yes
SHA-512/256	`sha512-256:`			64	56	Yes
SHA3-224	`sha3-224:`			56	48	Yes
SHA3-256	`sha3-256:`			64	56	Yes
SHA3-384	`sha3-384:`			96	80	Yes
SHA3-512	`sha3-512:`			128	104	Yes
BLAKE2s	`blake2s:`			64	56	Yes
BLAKE2b	`blake2b:`			128	104	Yes
BLAKE3	`blake3:`			64		Yes, and really fast

chris-ha458 commented 1 year ago

Would it be worth allowing the WARC-Block-Digest and WARC-Payload-Digest headers to be repeatable? Then it would be possible to include hashes with more than one algorithm if so desired. While preimage attacks are still practically impossible even for otherwise horribly broken algorithms like MD2, this would provide some additional resistance into the future if/when preimage resistance does get broken. And executing a preimage or collision attack on multiple algorithms at the same time would be more difficult than a single one, possibly even when there exist practical attacks on each individual algorithm.

I think allowing multiple digests would be the way to go.

First, it would make any attack exponentially difficult. If each hashes are cryptographically secure it would provide atleast the sum of their security in bits.

Secondly, it would provide a safe buffer for migration. Even if one of the hashes are deemed to have a security vulnerability the existence of another would allow security to be maintained while an alternative is implemented.

wumpus commented 1 year ago

I really prefer cryptographic hashes because they help me understand how to use them correctly. Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day.

I'm OK with having multiple digests although I think we should test a bunch of existing warc libraries to see how many break.

JustAnotherArchivist commented 1 year ago

The standard does not currently allow repeating the digest headers, so I'm not sure that's relevant. A library that currently accepts repeated headers is not implementing the standard correctly. It could be allowed in a future WARC version, and then libraries implementing support for that version would have to also support the repetitions.

wumpus commented 1 year ago

Most libraries aren't standard-conforming, so it's worth checking how much of a change this proposal will be.

chris-ha458 commented 1 year ago

I really prefer cryptographic hashes because they help me understand how to use them correctly

The specification as it is, does not recommend any hash. Therefore the choice is up to the library developers as it is. Common Crawl, for instance, is using SHA-1 which is neither fast nor cryptographically secure. As a standard, I think it might be difficult to dictate a cryptographically secure standard, especially such securities depend on a lot including threat models. The specification might be able mention that such qualities exist, and should be considered by developers in selecting a hash.

Currently I have not read about anyone trying to dedup Common Crawl using the existing digests. I've seen people dedup paragraphs, and that's how blekko deduped back in the day.

I am considering the viablity of this right now. Whether it is a feasible choice either independently or not, is not the scope of this discussion (although i'd welcome to discuss it further in another setting), but the fact remains that these digests exist in both the specification and individual downstream projects such as Common Crawl and that it makes it a very enticing option.

wumpus commented 1 year ago

I'm confused by your comment. I wasn't suggesting that the standard should recommend or dictate any hash. If you think my comment isn't in scope for the discussion, I'm happy to not participate.

chris-ha458 commented 1 year ago

I apologize for the confusion. It seems that I was the one to have opened up this issue into a wider discussion so it seems unfair for me to now arbtitrarily limit the scope of discussion. To address this, I will open a new PR and associated issue to discuss potential new community hash digest recommendations and any qualities(cryptographic security, speed) and functionalities(downstream dedup etc) that such hash must consider.

As i've said, i welcome such discussions especially since I am working on such things right now. I merely wished to avoid this issue being diluted too much (I must admit was my fault).

iipc / warc-specifications

SHA256 or SHA-256? #80

Noncryptographic Algorithms

Cryptographic Algorithms