Open jbenet opened 10 years ago
@dominictarr @substack @maxogden @davidad @ali01 @msparks @sqs @feross @dcposch
what are the scenarios this will get used in? I'm guessing it's something like this:
There is a widely used hash function X. later, people will develop better hash functions, but there will still be plenty of people using X. Later, weaknesses are discovered in X, everyone is suggested to update to Y.
This is far far fewer hash functions than if you just always use the best hash function around. You probably shouldn't just update your app to the current flavor of the month... this would really be something you might do once in a decade or two.
If you have hash trees, which I'm guessing you would... then you well have to rebuild all your data anyway. maybe all you need to pass the current hash function in the handshake or config file, and have tests where you inject a different hash function or allow building with a new hash function.
Re. the UTF-8 question, I think it is so that "continuation" bytes are unambiguous vs. "starting" bytes. This means that e.g. if is a stream is chopped off in the middle of a multibyte glyph, the remote will know that it does not have a complete glyph at the start of the stream.
Anyway, @jbenet, when I was thinking about the wire protocol for my "new OSI model", I came to roughly the same conclusion, that we need a prefix field to specify ciphersuite/protocol version. The way I figured it, the cost of a varint is not worth it, since if push comes to shove, a "protocol version" could be standardized that has an additional ciphersuite field with a bigger size. However, I decided on two bytes rather than one (same size as the TLS ciphersuite field, and with IANA's "port numbers" in mind).
BTW, a bonus of ciphersuite prefix field is that an appropriate lookup table can be used to determine the length of the hash, so other content can follow it on the wire with no delimiters.
I would like to put in a word for making it a true ciphersuite field, rather than a "hash function suite" field. A ciphersuite specifies a hash function, anyway; why invent an entirely new, incompatible numbering scheme for a smaller concept? If you want to use shiny new hashes like BLAKE2b that aren't part of a TLS ciphersuite yet, just use an existing unassigned block like {0xdc, *} and define your own set of ciphersuites there, copying some existing ciphersuite but modifying the hash function. (I recommend copying the ECDHE-ECDSA-ChaCha20-Poly1305 ciphersuite if it's all the same to you.)
@dominictarr
You probably shouldn't just update your app to the current flavor of the month... this would really be something you might do once in a decade or two.
Yeah definitely.
If you have hash trees, which I'm guessing you would... then you well have to rebuild all your data anyway.
Maybe, not always. For example, you might be switching from sha256 to blake2b for the speed improvement, but don't necessarily need to change all the stored data. (And even if you do, in a live p2p system, you'll be upgrading over time). So as I see it, hash functions will need to coexist in systems aiming to have data in use ~10 years. So git + bittorrent definitely qualify :)
@davidad
I think it is so that "continuation" bytes are unambiguous vs. "starting" bytes. This means that e.g. if is a stream is chopped off in the middle of a multibyte glyph, the remote will know that it does not have a complete glyph at the start of the stream.
Ahh, makes sense. But seems brittle. Reliability (and message/item atomicity) will still have to be the responsibility of the stream's client. So this probably doesn't need it.
Anyway, @jbenet, when I was thinking about the wire protocol for my "new OSI model", I came to roughly the same conclusion
Yeah, figured you ran into this too :)
the cost of a varint is not worth it, since if push comes to shove, a "protocol version" could be standardized that has an additional ciphersuite field with a bigger size. However, I decided on two bytes rather than one (same size as the TLS ciphersuite field, and with IANA's "port numbers" in mind).
Seems like UTF-16 where the varint cost is enormous. :/
BTW, a bonus of ciphersuite prefix field is that an appropriate lookup table can be used to determine the length of the hash, so other content can follow it on the wire with no delimiters.
Oh yeah! That's great.
I would like to put in a word for making it a true ciphersuite field, rather than a "hash function suite" field. A ciphersuite specifies a hash function, anyway; why invent an entirely new, incompatible numbering scheme for a smaller concept?
Yeah, sgtm.
If you want to use shiny new hashes like BLAKE2b that aren't part of a TLS ciphersuite yet, just use an existing unassigned block like {0xdc, *} and define your own set of ciphersuites there, copying some existing ciphersuite but modifying the hash function.
It's kind of silly that the combinations need to be enumerated. Saves space in the end, due to valid combinations sparsity? Or just good old standard IANA protocol? For many use cases though, one only needs a hash function (say hashes in git). Would be nice to just use standardized values for each of the components, and allocates a byte per component wanted in the application.
https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-4
Hey look! It's a really slow package manager. Guess the size of the id namespace calls for a really slow publish process. :)
Aaaaaand, here's what we're looking for:
https://www.iana.org/assignments/tls-parameters/tls-parameters.xhtml#tls-parameters-18
Boom. Maybe just use those.
(I recommend copying the ECDHE/ChaCha/Poly1305 ciphersuite if it's all the same to you.)
ECDHE-{ECDSA,RSA}-CHACHA20-POLY1305-BLAKE2
?
Btw, @davidad's new OSI (an awesome proposal) is here: http://davidad.github.io/blog/2014/04/24/an-osi-layer-model-for-the-21st-century/
@jbenet Re. UTF-8 vs UTF-16, saving a byte is one thing, but the CPU cost of processing or filtering packets is another, and I think the UTF-8 style varint imposes a heavy penalty on the latter that is not really justified, since there's only going to be one of these objects per packet. (UTF-8 makes sense because text files are entirely full of packed glyphs, among other reasons like ASCII compatibility.) And they really ought to be able to fit in two bytes for pretty much ever (but unlike Y2K, if that assumption turns out to be wrong, it is at least possible to back out of it).
the CPU cost of processing or filtering packets is another
Yeah, that's fair.
since there's only going to be one of these objects per packet.
I plan to use this per-hash. So, say, an ipfs tree object would have many.
And they really ought to be able to fit in two bytes for pretty much ever
Yep. Unless there's a proliferation of hash functions made possible by the emergence of a more open + interoperable way of specifying hash function. (i can imagine variants/wrappers of others)
Very neat!
crypt(3)
is a similar idea, but password-focused.
@msparks said
crypt(3)
is a similar idea, but password-focused.
You mean the way crypt
stores things ($id$salt$encrypted
) ? If so, this way of storing is like the sha256-
str prefix discussed above.
crypt(3)
is similar in that it uses ID prefixes, e.g., a $1$
prefix is MD5 and a $5$
prefix is SHA-256. It's dissimilar in that it adds the ID to the ASCII-readable digest. If I read your proposal correctly, you're prepending the ID to the hash output bytes.
(Note: I'm not saying anything novel; just mentioning prior work.)
If I read your proposal correctly, you're prepending the ID to the hash output bytes.
@msparks yep. You can see https://github.com/jbenet/go-multihash/blob/master/multihash.go -- this makes it en/decodable between binary <--> string. (prepending $5$
to a hex digest is bad because $ is not hex, so converting becomes problematic).
You can see this technique being used in https://github.com/jbenet/node-ipfs
so what has been bothering me about this idea is that the great thing about content addressing is that everything has only one name - so if you start using multiple hashes it gets weird because what happens when the same object is refered to by two hashes?
but this morning I had an idea: what if you just tag the object being hashed with the type of hash that should be used - now, this idea is not without problems - but it would mean there is a canocal referent for any object, so you could easily mix sha256 with BLAKE2, and follow refs from one to the other.
something like this:
//automatically create a sha256 hash
var hash = autohash({
message: "hello", hash: 'sha256'
})
//automatically create a blake2 hash
var hash = autohash({
message: "hello", hash: 'blake2'
})
Now, instances could make a local choice about when they wanted to upgrade their hash, and other instances - as long as they know how to compute that hash can use that. if they do not have that hash - they simply will not be able to process data from the new instances until they upgrade their software.
If you need to refer to an object with a hash that you now consider insecure, you can refer to it by it's old hash, and then give a second check hash with a secure algorithm. So, old data could be incrementally secured when it's referenced by new data!
so what has been bothering me about this idea is that the great thing about content addressing is that everything has only one name - so if you start using multiple hashes it gets weird because what happens when the same object is refered to by two hashes?
This isn't true in practice. It's possible to get everything to use the same name, but in practice you end up chunking large files anyway (can't sha256 the whole thing), so now you've got infinite numbers of ways you could hash the same file with the same hash function.
what if you just tag the object being hashed with the type of hash that should be used - now, this idea is not without problems - but it would mean there is a canocal referent for any object, so you could easily mix sha256 with BLAKE2, and follow refs from one to the other.
So putting the multihash function tag into the data too? Would have to also describe how to hash it (whole thing, how to get block boundaries, etc, which just turns into a dag anyway). I like this idea.
If you need to refer to an object with a hash that you now consider insecure, you can refer to it by it's old hash, and then give a second check hash with a secure algorithm. So, old data could be incrementally secured when it's referenced by new data!
Yeah I like this.
in practice you end up chunking large files anyway
I'm curious who does this? As a counterexample, you can download the Geocities archive (640GB+) as one torrent, with one info hash (2DC18F47AFEE0307E138DAB3015EE7E5154766F6).
as one torrent, with one info hash (2DC18F47AFEE0307E138DAB3015EE7E5154766F6).
And what does that infohash mean @feross ? What is it a hash of? (spoiler alert: chunks)
I still think that @dominictarr's original point is valid though:
if you start using multiple hashes it gets weird because what happens when the same object is referred to by two hashes?
At least 2DC18F47AFEE0307E138DAB3015EE7E5154766F6 uniquely identifies the tree of folders and files that comprise this particular user's dump of Geocities.
so now you've got infinite numbers of ways you could hash the same file with the same hash function
This could be standardized. Remove silly fields like "title" from the info dictionary (so it's not part of the content that gets hashed). Standardize the piece length selection algorithm (webtorrent uses piece-length btw). And then you're pretty dang close to the promise of one file/folder = one hash.
At least 2DC18F47AFEE0307E138DAB3015EE7E5154766F6 uniquely identifies the tree of folders and files that comprise this particular user's dump of Geocities.
Precisely the point i made :). When you're hashing a massive file, you end up with a hash that uniquely identifies the last layer of the particular chunking/blocking algorithms you chose. Not the final, underlying file itself.
This could be standardized.
How is this different from saying, "in this application, we will hash with hash function X, and chunk files using deterministic function Y"? The IPFS merkledag, and the rabin-fingerprinting file chunking, are examples of such standards.
@dominictarr's point is this:
so what has been bothering me about this idea is that the great thing about content addressing is that everything has only one name - so if you start using multiple hashes it gets weird because what happens when the same object is refered to by two hashes?
My point is this: standards are the only way to address this problem, and the "hash functions go stale" problem.
I don't think that this precludes some sort of deterministic tree hash. maybe it could even have the same result as a non-tree hash for small files.
That said - there is quite a bit more design space in the field of replicating large files. That is something you might want to upgrade separately, and maybe not because of a problem with the crypto, but prehaps just an improvement in the protocol.
So, if a hash points indirectly to some large object that must be replicated over some special protocol, such an bittorrent or rabin tree or merkle tree, then prehaps the hash should be tagged with that?
<hash>-bt
or <hash>-rabin
or <hash>-mrkl
...
in the case of bittorrent the infofile is small, but defines what to get next. with a merkle or rabin tree that may not be the case... basically, treating this like a type of hash? of course, this is a hash with parameters, you could use different types of hashes within the tree, etc so maybe you need more than just one number anyway.
or maybe you could always point to a header file which contained metadata such as what parameters the large file uses... of course this assumes that there is a threashold where anything smaller is just a straightforward hash.
These all sound like special cases of the more general "just use a merkle dag". All of these can be implemented on top -- up to the users to decide how to index, but all the indices work everywhere.
That would be correct, but can eliminate round trips by having more than one replication protocol.
Maybe you have a structure that has some well defined properties, say maybe it's a linked list.
Using a standard merkle dag approach, requesting each hash, would take O(N) round trips, but if you could request_since(list_id, count)
where list_id
identifies the list (say, it's the hash(public_key)
that can be used to verify the list) and count
is just a the number of items you already have... then it would only take O(1)
round trips!
If you really want to make this InterPlanetary then round trips becomes a huge problem. Light speed round trip to the moon is 2.6 seconds! To Mars, Venus, the Asteroids? I'll leave that as an exercise for the reader...
There are even places on earth where the round trip time is a source of frustration when people misdesign protocols or applications to unnecessarily round-trip. You should come on a holiday to visit me in New Zealand and experience it first hand!
oh hang on, did I just advocate cypherspace urls now?
If you really want to make this InterPlanetary then round trips becomes a huge problem.
Yeah, absolutely. I'd use a routing system that's aware of this (i.e. make requests across high latency rarely). (technically coral is already latency-cluster aware, but planetary scale is a completely different beast)
did I just advocate cypherspace urls now?
This is what /ipfs/<hash>/<path>
and /ipns/<hash>/<path>
is all about. See also https://gist.github.com/jbenet/ca4f31dfbaec7c8ce9a8 / #28
(the first path component is a protocol. one of my goals in life is to move us away from the "scheme identifier". It breaks links + mounting in these links.)
Okay so how do you plan to "make requests rarely"? I thought you just had a want-list?
Having obsessively studied data-replication for a few years now, I just don't believe there is a golden hammer that solves the problem efficiently in general.
Okay so how do you plan to "make requests rarely"? I thought you just had a want-list?
No i mean the routing system should be aware of this. One extreme solution is you have 2 planetary dhts -- if you're in mars you don't use Earth's -- and use something else latency aware to route requests + fetch things back across interplanetary links. A better solution bakes all of this into itself with latency measurement + ways to express variable blocks (i.e. as you've suggested, retrieving path globs :+1: )
so if you are gonna have path globs why not have multilpe protocols? maybe the compromise is to be able to fallback to requesting blobs from a want list directly?
the replication protocol in scuttlebutt is certainly "a way to express variable blocks", and it's intentionally simple. Probably something like https://github.com/jbenet/ipfs/issues/8 is probably possible, but it's an area that has not been explored yet, so I don't think we can simply hold it up and say it's a solved problem. There are other classes of structure that can be replicated efficiently, but again, these do not solve replication for general purpose structure.
Yes, encoding the number of continuation bytes in unary (base-1) followed by a zero bit, followed by enough data bits to finish the byte, followed by the continuation bytes (in network (big endian) byte order) makes a lot of sense. (In other words, like UTF-8, but without making all of the continuation bytes 10xxxxxx.) Inside Google, these are known as prefix varints. This (1) keeps values in lexographical order (ignoring the zigzag encoding that Google uses to move the sign bit to the least significant bit) and (2) is faster to decode and encode than the base-128 varint scheme used by Protocol Buffers and LLVM bitcode. I think if Google had it to do over, they'd use prefix varints in Protocol Buffers.
In other words, like UTF-8, but without making all of the continuation bytes 10xxxxxx
yeah +1
his (1) keeps values in lexographical order (ignoring the zigzag encoding that Google uses to move the sign bit to the least significant bit) and (2) is faster to decode and encode than the base-128 varint scheme used by Protocol Buffers
very much +1
I think if Google had it to do over, they'd use prefix varints in Protocol Buffers.
wish the various authors published lessons learned / recommendations for the future.
btw, jeff dean had a nice discussion on very large varints here: http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/WSDM09-keynote.pdf
Problem
As time passes, software that uses a particular hash function will often need to upgrade to a better, faster, stronger, ... one. This introduces large costs: systems may assume a particular hash size, or call
sha1
all over the place.It's already common to see hashes prefixed with a function id:
Is this the best way? Maybe it is. But there are some problems:
blake2b-
may matter. So we might want to use a much narrower prefix. Particularly given that "widely used and accepted secure cryptographic hash functions" tend to change very little over time (by 2014 there's less than 256 that you might seriously consider).Is there an RFC for this? I haven't found a "Hash Function Suite" like the "Cypher Suite" in TLS (RFC 5246/A.5).
Potential solutions
Use a short prefix mapping to some "crytographic hash function" suite. This already has to be done: the
sha1-
prefix is more human readable, but probably not a good idea to blindly dispatch a function based on the stringsha1
. Whitelisting specific strings (a blessed table) already happens.So what would this look like? For example, suppose sha1 is
0x01
Pros:
Cons:
0x01
is sha1,0x02
is sha256, etc.on varints
Ideally, for proper future proofing, we want a varint. Though it is to be noted that varints are annoying to parse + slower than fixed-width ints. There are so few "widely used...hash functions" that it may be okay to get away with one byte. Luckily, can wait until we reach 127 functions before we have to decide which one :)
May be able to repurpose utf-8 implementations for this.
* Random UTF-8 question: * why are the subsequent bytes wasting two bits each?? '10' prefix below.
From http://en.wikipedia.org/wiki/UTF-8#Description
Is it to keep the code point ranges nice and rounded-ish?