RFC: Future-proofed cryptographic hash values.

jbenet commented 10 years ago

Problem

As time passes, software that uses a particular hash function will often need to upgrade to a better, faster, stronger, ... one. This introduces large costs: systems may assume a particular hash size, or call sha1 all over the place.

It's already common to see hashes prefixed with a function id:

sha1-a651ec3c4cc479977777f916fcedb221f38aaba1
sha256-aec71a4d4a8f44bc0c3e1133d5544d724b857cf20fe5aaeb1bc4d6e7c1ee68f1

Is this the best way? Maybe it is. But there are some problems:

Hashes tend to be transferred/printed encoded in hex, base32, base64, base54, etc. The name of the hash function may not be compatible with your encoding. (e.g. the hashes above are hex values, but 's' is not a valid hex char) This introduces annoying complexity when merely encoding/decoding hashes for storing / transferring / printing out to uses. (Ugh!) This gets worse when "things expecting a hex hash" that you don' control cannot be used with this scheme.
When storing millions of hashes, the extra byte costs of something like blake2b- may matter. So we might want to use a much narrower prefix. Particularly given that "widely used and accepted secure cryptographic hash functions" tend to change very little over time (by 2014 there's less than 256 that you might seriously consider).

Is there an RFC for this? I haven't found a "Hash Function Suite" like the "Cypher Suite" in TLS (RFC 5246/A.5).

Potential solutions

Use a short prefix mapping to some "crytographic hash function" suite. This already has to be done: the sha1- prefix is more human readable, but probably not a good idea to blindly dispatch a function based on the string sha1. Whitelisting specific strings (a blessed table) already happens.

So what would this look like? For example, suppose sha1 is 0x01

# name prefixed
sha1-0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 # hex
sha1-bpxmpnpkh4h5xsk5bxkh6pc3yj25vcrt # base32
sha1-aef1qioaay9wkladm4f7a6nubty # base58
sha1-c+7hteo/d9vjxq3ufzxbwnxaijm # base64

sha256-2c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae # hex
sha256-fqtli23i77di76m3iu6b2mcbgqjuellqmsb37ihzrjpiqytg46xa # base32
sha256-3ymapqcucjxdwprbjfr5mjcpthqfg8pux1txqrem35jj # base58
sha256-lca0a2j/xo/5m0u8htbbnbnclxbkg7+g+ypeigjm564 # base64

# id prefixed (`0x01 for sha1, and `0x02` for sha256)
010beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33 # sha1 in hex
aef65r5v5i7q7w6jlug5i7z4lpbhlwukgm # sha1 in base32
4jvyy9wgauheuckrj7szdd2e1vqs # sha1 in base58
aqvux7xqpw/byv0n1h88w8j12ooz # sha1 in base64

022c26b46b68ffc68ff99b453c1d30413413422d706483bfa0f98a5e886266e7ae # sha256 in hex
aiwcnndlnd74nd7ztnctyhjqie2bgqrnobsihp5a7gff5cdcm3t24 # sha256 in base32
eryupqi6npzkezrsc1mgaaorxh7tkyy6v7nc8h5t4zeh # sha256 in base58
aiwmtgto/8ap+ztfpb0wqtqtqi1wzio/opmkxohizueu # sha256 in base64

Pros:

hash values are consistent with the encoding :)
shorter

Cons:

numbers are hard to human-read. This is a less strong point, as some ids would be quickly recognizable. e.g. 0x01 is sha1, 0x02 is sha256, etc.
prefixing bytes makes encoded values change altogether. :(

on varints

Ideally, for proper future proofing, we want a varint. Though it is to be noted that varints are annoying to parse + slower than fixed-width ints. There are so few "widely used...hash functions" that it may be okay to get away with one byte. Luckily, can wait until we reach 127 functions before we have to decide which one :)

May be able to repurpose utf-8 implementations for this.

* Random UTF-8 question: * why are the subsequent bytes wasting two bits each?? '10' prefix below.

U+0000    U+007F     1 0xxxxxxx
U+0080    U+07FF     2 110xxxxx  10xxxxxx
U+0800    U+FFFF     3 1110xxxx  10xxxxxx  10xxxxxx
U+10000   U+1FFFFF   4 11110xxx  10xxxxxx  10xxxxxx  10xxxxxx
U+200000  U+3FFFFFF  5 111110xx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx
U+4000000 U+7FFFFFFF 6 1111110x  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx  10xxxxxx

From http://en.wikipedia.org/wiki/UTF-8#Description

Is it to keep the code point ranges nice and rounded-ish?

jbenet commented 10 years ago

@dominictarr @substack @maxogden @davidad @ali01 @msparks @sqs @feross @dcposch

dominictarr commented 10 years ago

what are the scenarios this will get used in? I'm guessing it's something like this:

There is a widely used hash function X. later, people will develop better hash functions, but there will still be plenty of people using X. Later, weaknesses are discovered in X, everyone is suggested to update to Y.

This is far far fewer hash functions than if you just always use the best hash function around. You probably shouldn't just update your app to the current flavor of the month... this would really be something you might do once in a decade or two.

If you have hash trees, which I'm guessing you would... then you well have to rebuild all your data anyway. maybe all you need to pass the current hash function in the handshake or config file, and have tests where you inject a different hash function or allow building with a new hash function.

davidad commented 10 years ago

Re. the UTF-8 question, I think it is so that "continuation" bytes are unambiguous vs. "starting" bytes. This means that e.g. if is a stream is chopped off in the middle of a multibyte glyph, the remote will know that it does not have a complete glyph at the start of the stream.

davidad commented 10 years ago

Anyway, @jbenet, when I was thinking about the wire protocol for my "new OSI model", I came to roughly the same conclusion, that we need a prefix field to specify ciphersuite/protocol version. The way I figured it, the cost of a varint is not worth it, since if push comes to shove, a "protocol version" could be standardized that has an additional ciphersuite field with a bigger size. However, I decided on two bytes rather than one (same size as the TLS ciphersuite field, and with IANA's "port numbers" in mind).

BTW, a bonus of ciphersuite prefix field is that an appropriate lookup table can be used to determine the length of the hash, so other content can follow it on the wire with no delimiters.

I would like to put in a word for making it a true ciphersuite field, rather than a "hash function suite" field. A ciphersuite specifies a hash function, anyway; why invent an entirely new, incompatible numbering scheme for a smaller concept? If you want to use shiny new hashes like BLAKE2b that aren't part of a TLS ciphersuite yet, just use an existing unassigned block like {0xdc, *} and define your own set of ciphersuites there, copying some existing ciphersuite but modifying the hash function. (I recommend copying the ECDHE-ECDSA-ChaCha20-Poly1305 ciphersuite if it's all the same to you.)

jbenet commented 10 years ago

@dominictarr

You probably shouldn't just update your app to the current flavor of the month... this would really be something you might do once in a decade or two.

Yeah definitely.

If you have hash trees, which I'm guessing you would... then you well have to rebuild all your data anyway.

Maybe, not always. For example, you might be switching from sha256 to blake2b for the speed improvement, but don't necessarily need to change all the stored data. (And even if you do, in a live p2p system, you'll be upgrading over time). So as I see it, hash functions will need to coexist in systems aiming to have data in use ~10 years. So git + bittorrent definitely qualify :)

@davidad

I think it is so that "continuation" bytes are unambiguous vs. "starting" bytes. This means that e.g. if is a stream is chopped off in the middle of a multibyte glyph, the remote will know that it does not have a complete glyph at the start of the stream.

Ahh, makes sense. But seems brittle. Reliability (and message/item atomicity) will still have to be the responsibility of the stream's client. So this probably doesn't need it.

Anyway, @jbenet, when I was thinking about the wire protocol for my "new OSI model", I came to roughly the same conclusion

Yeah, figured you ran into this too :)

the cost of a varint is not worth it, since if push comes to shove, a "protocol version" could be standardized that has an additional ciphersuite field with a bigger size. However, I decided on two bytes rather than one (same size as the TLS ciphersuite field, and with IANA's "port numbers" in mind).

Seems like UTF-16 where the varint cost is enormous. :/

BTW, a bonus of ciphersuite prefix field is that an appropriate lookup table can be used to determine the length of the hash, so other content can follow it on the wire with no delimiters.

Oh yeah! That's great.

I would like to put in a word for making it a true ciphersuite field, rather than a "hash function suite" field. A ciphersuite specifies a hash function, anyway; why invent an entirely new, incompatible numbering scheme for a smaller concept?

Yeah, sgtm.

If you want to use shiny new hashes like BLAKE2b that aren't part of a TLS ciphersuite yet, just use an existing unassigned block like {0xdc, *} and define your own set of ciphersuites there, copying some existing ciphersuite but modifying the hash function.

It's kind of silly that the combinations need to be enumerated. Saves space in the end, due to valid combinations sparsity? Or just good old standard IANA protocol? For many use cases though, one only needs a hash function (say hashes in git). Would be nice to just use standardized values for each of the components, and allocates a byte per component wanted in the application.

https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-4

Hey look! It's a really slow package manager. Guess the size of the id namespace calls for a really slow publish process. :)

Aaaaaand, here's what we're looking for:

https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-18

Boom. Maybe just use those.

(I recommend copying the ECDHE/ChaCha/Poly1305 ciphersuite if it's all the same to you.)

ECDHE-{ECDSA,RSA}-CHACHA20-POLY1305-BLAKE2 ?

Btw, @davidad's new OSI (an awesome proposal) is here: http://davidad.github.io/blog/2014/04/24/an-osi-layer-model-for-the-21st-century/

davidad commented 10 years ago

@jbenet Re. UTF-8 vs UTF-16, saving a byte is one thing, but the CPU cost of processing or filtering packets is another, and I think the UTF-8 style varint imposes a heavy penalty on the latter that is not really justified, since there's only going to be one of these objects per packet. (UTF-8 makes sense because text files are entirely full of packed glyphs, among other reasons like ASCII compatibility.) And they really ought to be able to fit in two bytes for pretty much ever (but unlike Y2K, if that assumption turns out to be wrong, it is at least possible to back out of it).

jbenet commented 10 years ago

the CPU cost of processing or filtering packets is another

Yeah, that's fair.

since there's only going to be one of these objects per packet.

I plan to use this per-hash. So, say, an ipfs tree object would have many.

And they really ought to be able to fit in two bytes for pretty much ever

Yep. Unless there's a proliferation of hash functions made possible by the emergence of a more open + interoperable way of specifying hash function. (i can imagine variants/wrappers of others)

jbenet commented 10 years ago

First pass:

feross commented 10 years ago

Very neat!

msparks commented 10 years ago

crypt(3) is a similar idea, but password-focused.

jbenet commented 10 years ago

https://github.com/jbenet/go-multihash too

@msparks said

crypt(3) is a similar idea, but password-focused.

You mean the way crypt stores things ($id$salt$encrypted) ? If so, this way of storing is like the sha256- str prefix discussed above.

msparks commented 10 years ago

crypt(3) is similar in that it uses ID prefixes, e.g., a $1$ prefix is MD5 and a $5$ prefix is SHA-256. It's dissimilar in that it adds the ID to the ASCII-readable digest. If I read your proposal correctly, you're prepending the ID to the hash output bytes.

(Note: I'm not saying anything novel; just mentioning prior work.)

jbenet commented 10 years ago

If I read your proposal correctly, you're prepending the ID to the hash output bytes.

@msparks yep. You can see https://github.com/jbenet/go-multihash/blob/master/multihash.go -- this makes it en/decodable between binary <--> string. (prepending $5$ to a hex digest is bad because $ is not hex, so converting becomes problematic).

jbenet commented 10 years ago

You can see this technique being used in https://github.com/jbenet/node-ipfs

dominictarr commented 9 years ago

so what has been bothering me about this idea is that the great thing about content addressing is that everything has only one name - so if you start using multiple hashes it gets weird because what happens when the same object is refered to by two hashes?

but this morning I had an idea: what if you just tag the object being hashed with the type of hash that should be used - now, this idea is not without problems - but it would mean there is a canocal referent for any object, so you could easily mix sha256 with BLAKE2, and follow refs from one to the other.

something like this:

//automatically create a sha256 hash
var hash = autohash({
  message: "hello", hash: 'sha256'
})
//automatically create a blake2 hash
var hash = autohash({
  message: "hello", hash: 'blake2'
})

Now, instances could make a local choice about when they wanted to upgrade their hash, and other instances - as long as they know how to compute that hash can use that. if they do not have that hash - they simply will not be able to process data from the new instances until they upgrade their software.

If you need to refer to an object with a hash that you now consider insecure, you can refer to it by it's old hash, and then give a second check hash with a secure algorithm. So, old data could be incrementally secured when it's referenced by new data!

jbenet commented 9 years ago

so what has been bothering me about this idea is that the great thing about content addressing is that everything has only one name - so if you start using multiple hashes it gets weird because what happens when the same object is refered to by two hashes?

This isn't true in practice. It's possible to get everything to use the same name, but in practice you end up chunking large files anyway (can't sha256 the whole thing), so now you've got infinite numbers of ways you could hash the same file with the same hash function.

what if you just tag the object being hashed with the type of hash that should be used - now, this idea is not without problems - but it would mean there is a canocal referent for any object, so you could easily mix sha256 with BLAKE2, and follow refs from one to the other.

So putting the multihash function tag into the data too? Would have to also describe how to hash it (whole thing, how to get block boundaries, etc, which just turns into a dag anyway). I like this idea.

If you need to refer to an object with a hash that you now consider insecure, you can refer to it by it's old hash, and then give a second check hash with a secure algorithm. So, old data could be incrementally secured when it's referenced by new data!

Yeah I like this.

feross commented 9 years ago

in practice you end up chunking large files anyway

I'm curious who does this? As a counterexample, you can download the Geocities archive (640GB+) as one torrent, with one info hash (2DC18F47AFEE0307E138DAB3015EE7E5154766F6).

jbenet commented 9 years ago

as one torrent, with one info hash (2DC18F47AFEE0307E138DAB3015EE7E5154766F6).

And what does that infohash mean @feross ? What is it a hash of? (spoiler alert: chunks)

feross commented 9 years ago

I still think that @dominictarr's original point is valid though:

if you start using multiple hashes it gets weird because what happens when the same object is referred to by two hashes?

At least 2DC18F47AFEE0307E138DAB3015EE7E5154766F6 uniquely identifies the tree of folders and files that comprise this particular user's dump of Geocities.

so now you've got infinite numbers of ways you could hash the same file with the same hash function

This could be standardized. Remove silly fields like "title" from the info dictionary (so it's not part of the content that gets hashed). Standardize the piece length selection algorithm (webtorrent uses piece-length btw). And then you're pretty dang close to the promise of one file/folder = one hash.

jbenet commented 9 years ago

At least 2DC18F47AFEE0307E138DAB3015EE7E5154766F6 uniquely identifies the tree of folders and files that comprise this particular user's dump of Geocities.

Precisely the point i made :). When you're hashing a massive file, you end up with a hash that uniquely identifies the last layer of the particular chunking/blocking algorithms you chose. Not the final, underlying file itself.

This could be standardized.

How is this different from saying, "in this application, we will hash with hash function X, and chunk files using deterministic function Y"? The IPFS merkledag, and the rabin-fingerprinting file chunking, are examples of such standards.

@dominictarr's point is this:

so what has been bothering me about this idea is that the great thing about content addressing is that everything has only one name - so if you start using multiple hashes it gets weird because what happens when the same object is refered to by two hashes?

My point is this: standards are the only way to address this problem, and the "hash functions go stale" problem.

dominictarr commented 9 years ago

I don't think that this precludes some sort of deterministic tree hash. maybe it could even have the same result as a non-tree hash for small files.

That said - there is quite a bit more design space in the field of replicating large files. That is something you might want to upgrade separately, and maybe not because of a problem with the crypto, but prehaps just an improvement in the protocol.

dominictarr commented 9 years ago

So, if a hash points indirectly to some large object that must be replicated over some special protocol, such an bittorrent or rabin tree or merkle tree, then prehaps the hash should be tagged with that?

<hash>-bt or <hash>-rabin or <hash>-mrkl ...

in the case of bittorrent the infofile is small, but defines what to get next. with a merkle or rabin tree that may not be the case... basically, treating this like a type of hash? of course, this is a hash with parameters, you could use different types of hashes within the tree, etc so maybe you need more than just one number anyway.

dominictarr commented 9 years ago

or maybe you could always point to a header file which contained metadata such as what parameters the large file uses... of course this assumes that there is a threashold where anything smaller is just a straightforward hash.

jbenet commented 9 years ago

These all sound like special cases of the more general "just use a merkle dag". All of these can be implemented on top -- up to the users to decide how to index, but all the indices work everywhere.

dominictarr commented 9 years ago

That would be correct, but can eliminate round trips by having more than one replication protocol.

Maybe you have a structure that has some well defined properties, say maybe it's a linked list. Using a standard merkle dag approach, requesting each hash, would take O(N) round trips, but if you could request_since(list_id, count) where list_id identifies the list (say, it's the hash(public_key) that can be used to verify the list) and count is just a the number of items you already have... then it would only take O(1) round trips!

If you really want to make this InterPlanetary then round trips becomes a huge problem. Light speed round trip to the moon is 2.6 seconds! To Mars, Venus, the Asteroids? I'll leave that as an exercise for the reader...

There are even places on earth where the round trip time is a source of frustration when people misdesign protocols or applications to unnecessarily round-trip. You should come on a holiday to visit me in New Zealand and experience it first hand!

dominictarr commented 9 years ago

oh hang on, did I just advocate cypherspace urls now?

jbenet commented 9 years ago

If you really want to make this InterPlanetary then round trips becomes a huge problem.

Yeah, absolutely. I'd use a routing system that's aware of this (i.e. make requests across high latency rarely). (technically coral is already latency-cluster aware, but planetary scale is a completely different beast)

did I just advocate cypherspace urls now?

This is what /ipfs/<hash>/<path> and /ipns/<hash>/<path> is all about. See also https://gist.github.com/jbenet/ca4f31dfbaec7c8ce9a8 / #28

(the first path component is a protocol. one of my goals in life is to move us away from the "scheme identifier". It breaks links + mounting in these links.)

dominictarr commented 9 years ago

Okay so how do you plan to "make requests rarely"? I thought you just had a want-list?

Having obsessively studied data-replication for a few years now, I just don't believe there is a golden hammer that solves the problem efficiently in general.

jbenet commented 9 years ago

Okay so how do you plan to "make requests rarely"? I thought you just had a want-list?

No i mean the routing system should be aware of this. One extreme solution is you have 2 planetary dhts -- if you're in mars you don't use Earth's -- and use something else latency aware to route requests + fetch things back across interplanetary links. A better solution bakes all of this into itself with latency measurement + ways to express variable blocks (i.e. as you've suggested, retrieving path globs :+1: )

dominictarr commented 9 years ago

so if you are gonna have path globs why not have multilpe protocols? maybe the compromise is to be able to fallback to requesting blobs from a want list directly?

dominictarr commented 9 years ago

the replication protocol in scuttlebutt is certainly "a way to express variable blocks", and it's intentionally simple. Probably something like https://github.com/jbenet/ipfs/issues/8 is probably possible, but it's an area that has not been explored yet, so I don't think we can simply hold it up and say it's a solved problem. There are other classes of structure that can be replicated efficiently, but again, these do not solve replication for general purpose structure.

kmag commented 9 years ago

Yes, encoding the number of continuation bytes in unary (base-1) followed by a zero bit, followed by enough data bits to finish the byte, followed by the continuation bytes (in network (big endian) byte order) makes a lot of sense. (In other words, like UTF-8, but without making all of the continuation bytes 10xxxxxx.) Inside Google, these are known as prefix varints. This (1) keeps values in lexographical order (ignoring the zigzag encoding that Google uses to move the sign bit to the least significant bit) and (2) is faster to decode and encode than the base-128 varint scheme used by Protocol Buffers and LLVM bitcode. I think if Google had it to do over, they'd use prefix varints in Protocol Buffers.

jbenet commented 9 years ago

In other words, like UTF-8, but without making all of the continuation bytes 10xxxxxx

yeah +1

his (1) keeps values in lexographical order (ignoring the zigzag encoding that Google uses to move the sign bit to the least significant bit) and (2) is faster to decode and encode than the base-128 varint scheme used by Protocol Buffers

very much +1

I think if Google had it to do over, they'd use prefix varints in Protocol Buffers.

wish the various authors published lessons learned / recommendations for the future.

btw, jeff dean had a nice discussion on very large varints here: http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf

jbenet / random-ideas