Twitter thread with @zooko

@zooko and I are having a conversation on Twitter that's kind of hard to follow. I'm going to try to record it here. This is also a good place to link to other issues that get created based on this convo.

https://twitter.com/oconnor663/status/1058590807207395328

@oconnor663: @zooko I've been working on a tree hash based on BLAKE2b, and it's at the point where it needs a review from a Real Cryptographer. Do you know anyone who might be interested in collaborating on something like that?

@zooko: This is beautiful! https://github.com/oconnor663/bao/blob/master/docs/spec.md … 😍 Thank you for doing this! The spec/rationale seems pretty compete. (Caveat lector: I didn't really try to think very hard about dangers here, and I got only 3 hours of sleep last night, and I ought to be heading for the airport but instead I'm drinking more coffee and reading your spec.) My thoughts:

Maybe someone [*] should update the BLAKE2 spec's tree mode to reflect the way Bao does it. Bao is the only tree mode implementation I'm aware of. So basically, make Bao be the official tree mode for BLAKE2.
The thinnest part of the rationale was “why not 4-arity?”. The answers are (a) +50% memory requirements, (b) more complicated partial subtree merging and offset calcs, (c) it probably wouldn't be that much more efficient. As for 2.a, how many implementations are going to be okay with 2000 bytes RAM but not 2832 b RAM? 2.b: okay, fair point, but 2.c: How much more efficient! I guess in theory it should make little difference, but could you measure it without going to all effort of implementing 2.b?

And, have you considered BLAKE2s instead of BLAKE2b! Should be more CPU-efficient for the parent nodes, slightly more RAM-efficient, and you could try update8. Dunno if it would be a net win on CPU efficiency on desktop. Using BLAKE2s (at least in the parent nodes) takes advantage of a CPU efficiency optimisation that is provided by your choice of arity-2 and that you're otherwise currently leaving on the table. (And it would certainly be a good CPU efficiency win on 32-bit embedded implementations.)

Oh, before I arrive at the airport: the memoization of identical leafs/nodes actually kind of makes me nervous. I'm not sure what assumptions people will make about the lack of information leakage about the data that they incur if they share information about the nodes. A mismatch in the user's assumptions about that kind of thing vs the cryptographic guarantees is what underlies CRIME/BEAST-style attacks and "oops I didn't understand length-extension" attacks. It would be simpler and safer to not do that optimisation, and perhaps nobody will ever really miss it. All data-dependent optimisation (incl memoization and compression) is a “side-channel”.

Okay one last thing before I get to the security line: I'd suggest forbid trailing garbage and require 0's. Itsy bitsy added costs, and one fewer caveat that you have to document and that could harm someone who misunderstood it.

https://twitter.com/zooko/status/1060216680511619072

@zooko: Hey did you ever rescue this thread from twitter threading fails and post it somewhere? I want to know:

Did you benchmark it with BLAKE2s?
Can you get someone to benchmark it on a 32-bit ARM device?
What are you going to do about the "I decoded up to the end, but now the next byte contains garbage?" issue? I think you should require the input is length-delimited to the exact right length. (Granted, it isn't obvious how to do that. But possible.)
What do you think of: "I think you should define Bao in such a way that it is impossible for an implementation to use this optimisation. I think it could be dangerous under certain uses, and Bao is intended to be a general-purpose tool for many currently-unknown uses."

@oconnor663: I spent the last couple days playing with https://github.com/oconnor663/bao_experiments, to test the performance ideas you had. I just put up the results in the README there. The summary seems to be that BLAKE2s doesn't help, a 4-ary tree does help, but larger chunks help even more. Of course all of that is contingent on me not having made an implementation mistake. Those benchmarks have very little testing and no external test vectors 😅

Regarding the choice of chunk size, I have a lot of disorganized thoughts here: https://github.com/oconnor663/bao/issues/17 …. I wonder if there's ever been a similar project that navigated this choice, where we could just reuse their rationale? I'd really like to hear your thoughts there.

I haven't gathered the thread yet, but I'll probably end up making a collection of GitHub issues out of it and tagging you guys there. I'm not sure how to get my hands on ARM benchmarks. What's the easiest thing to do there? Maybe order up a Raspberry Pi? Might be useful to have around anyway :)

I have trouble coming up with a scenario where the trailing-garbage issue matters. The victim would need to both 1) compare encoded files byte-for-byte instead of by hash, and 2) somehow hit a security issue with "different" files producing the same output. Thoughts?

I thought the sparse disk hashing use case was kind of cool, but I'm not wedded to it. Do you think e.g. the Sakura coding should have included a "countermeasure" for it? It doesn't seem like the something a lazy implementer would do, which is usually what I worry about.

https://twitter.com/zooko/status/1060342872946495488

@zooko: Hm... I give you a file, you view it with a viewer and decide to sign it. You feed it to a bao hasher which outputs a hash. You sign the hash. I then give a different file to Bob, he feeds it to his bao hasher, checks that you signed the hash of it, then views it in his viewer.

@oconnor663: If you could do that it would be a hash collision, which would indeed be totally broken. But I don't think you can do that. The trailing garbage issue only has to do with encoding/decoding, not hashing. It's a mutation of the encoding, but not of the encoded contents.

@zooko: Let's say there's no hash collision. I'm imagining that the bao-hasher program is disregarding some "trailing garbage" and the viewer program isn't. To prevent that, we have to guarantee that both programs determine the length or termination of the data identically to each other.

@oconnor663: It's possible I'm misunderstanding the scenario, but I think if we're talking about a hasher program, there's no such thing as trailing garbage. The hasher (as opposed to the decoder) doesn't do anything resembling parsing, it just reads its stream of input bytes. If you add "trailing garbage" to the input you're feeding to a hasher, that's really the same as adding "more input". It's a different file now, and you'll get a different hash.

If you have a decoder program (where you give it an encoded stream somehow, and a hash to validate it), that's when parsing comes into the picture. That program would ignore trailing garbage in an encoding. But that said, you can't give a Bao encoding to your PDF viewer. It would look like random bytes intermixed with a PDF, and the PDF viewer would presumably barf and fail to parse it. I suppose its possible some very forgiving PDF viewer could manage to render something, but that's kinda like opening a JPEG in a PDF viewer and getting lucky. I don't expect most users will ever encounter a Bao encoded file the way they would a ZIP file or a tarball. It's the kind of thing an application might use under the hood in its network protocols, but probably not something you'd build a GUI app to decode.

For clarification here (not on Twitter), here's an example of "trailing garbage":

$ echo hi | bao encode - hi.bao
$ HASH=`echo hi | bao hash`
$ bao decode $HASH hi.bao
hi
$ echo some garbage bytes >> hi.bao
$ bao decode $HASH hi.bao
hi

What's happened here is that some garbage bytes got appended to the encoded file, but it didn't affect decoding. The decoder knows how long the encoding is supposed to be from its header, and it never even reads the trailing garbage. Thus two "different looking" encoded files could successfully decode to the same content.

The big question is whether there could ever be an attack exploiting this fact. Somehow an application would need to assume "different encoding implies different content" and then get surprised when that invariant fails to hold. So far I can't think of a practical reason an application would compare two different encoded files in the first place, but then again those are famous last words in crypto. I have a few old thoughts about this at https://github.com/oconnor663/bao/issues/2.

https://twitter.com/zooko/status/1060355276820414465 regarding https://github.com/oconnor663/bao_experiments

@zooko: Two curious things about this: 1. your blake2sp-flavored thing is slower than your blake2bp-flavored thing when @sevenps's blake2sp is faster than his blake2bp on the same machine. 2. bao-blake2hybrid is slower than bao-blake2b. That's really curious. Aren't the inputs to the nodes always 64 bytes? Isn't blake2s on 64 bytes faster than blake2b on 64 bytes?

@oconnor663: Yes, both were surprising. It's possible I made a mistake. Or maybe it's some weird effect where switching between BLAKE2b and BLAKE2s hurts the instruction cache or something? My implementations inline all their rounds, so they might be excessively large. Just guessing here. I'm also paying some overhead marshalling vector values back into unaligned scalars after each compression call. (This keeps the state structs portable, but we could totally change it.) That could be penalizing BLAKE2s more than BLAKE2b, not sure. Another factor here is that the single-instance implementation of BLAKE2b is using AVX2, but the single-instance implementation of BLAKE2s is portable. I need to pull in the single-instance SSE implementation of BLAKE2s, and I haven't yet.

@zooko: Yes, I imagine the marshalling/unmarshalling could make a difference and you should measure without it. [Regarding the missing SSE implementation] Ah, okay mystery solved. 🙂

@sneves: blake2sp is essentially the same speed as blake2bp but is more sensitive to compiler codegen quirks, so depending on compiler version/flags it is often slower.

https://twitter.com/oconnor663/status/1069645936069132289

@zooko: Like I previously mentioned, there were two details in the design that I didn't feel entirely comfortable with. Both of them are of the flavor that your specification should take away options from the implementor.

The first was the concept of "sparse files" with empty/null/all-zero blocks that could be handled in an optimised way. The second one I still haven't really understood, but it was the concept of "trailing garbage" in one of the encodings that would be ignored when processing. In my humble opinion, you should define Bao so that an implementor can't optimise the null blocks of a sparse file and still get the correct Bao hash output, and—to the degree possible—that the trailing garbage has to be all zeroes or else it is treated as invalid.

Actually I'm really not sure that latter one makes sense. Maybe those trailing bytes aren't actually processed by Bao per se anyway, so the Bao spec can't effectively constrain them. Anyway, moving right along.

My next thought is for wide acceptance/re-use/standardization (not for security per se): you should not allow options. For example, with regard to the block size, I think you should pick one and say that anything which doesn't use the same one isn't "Bao". I think this is a mistake we made with BLAKE2, that we defined 4 different versions (5 if you count the tree mode). I honestly think BLAKE2 would be more widely used today if we had defined only one of those four as being "BLAKE2".

Last, I suggest that you consider petitioning the other BLAKE2 authors (especially @veorq, who is an author of both BLAKE and BLAKE2) for them to bless Bao, once it is finished, as being "BLAKE2T". That way, it would inherit some of the positive reputation of BLAKE2 and it would get more widely trusted and more widely adopted.

One more detail is that IMHO you should at least partially tend toward the "minimax" strategy: go for substantial optimisations of the "worst cases" like the tiny-input case and 32-bit, single-core ARM devices even at a minor penalty to the "happy case". Oh, that reminds me! You should seriously think about a height limit. What's the current max RAM requirements with 64 height? Also I agree with what @sevenps said about using the compression function instead of the full hash function. [@oconnor663's note: This refers to another conversation we had over email.]

@oconnor663: Your and @sevenps's feedback has been incredibly helpful so far, thank you! In particular, your suggestion to use BLAKE2s was brilliant, and I've been working on that since our last conversation.One thing that's clear in retrospect is that since we're truncating the result to 32 bytes either way, doing 10 rounds of BLAKE2s is faster than doing 12 rounds of BLAKE2b, even with the two round functions taking roughly the same amount of time in AVX2-across-multiple-inputs. Switching to BLAKE2s should be a big win for the tiny-input-on-32-bit-ARM case. Bao's only overhead above BLAKE2s in that case is the 8-byte length suffix. Maybe that's small enough to be negligible? I suppose it depends on whether it pushes you over a block boundary.

The other big question I see related to mini-max, is the choice of block size. A smaller size would let us take advantage of parallelism for smaller inputs (still many kilobytes, "medium size"?), at the cost of more overhead in the large-file case. Very much an open question.

@zooko: Maybe this one's a case for "sweet spot" instead of "minimax". Overlay graphs of the CPU costs (cpb) for a smallish input (1 KB?) and the storage overhead for a large file, with X-axis = blocksize, and then just pick something a touch to the right of the "knee" of the first line?

Samuel Neves' idea about repurposing some of the inputs of the compression function would certainly increase throughput. (I haven't measured yet, but it sounds like "+50-75% for free".) But would it still be BLAKE2? If we're aiming for a BLAKE2T, how conservative should we be?

@zooko: See if you can get @sevenps and @veorq to design that part (apparently already done in a tweet!) and sign-off on it, and then that's conservative enough to merit the name imo. 🙂

I agree with your last thought about "trailing garbage" (which comes up during decoding but not during hashing or encoding). Bao doesn't even read those bytes, so trying to constrain them would be impractical. And I'd be worried about inconsistencies across implementations.

Agreed about making Bao "one size fits all". It might be worth noting in the spec that you could create a natural BLAKE2b-based variant, similar to what the @KeccakTeam did with KangarooTwelve/MarsupilamiFourteen. But that variant would have a different name.

Re a height limit, take a look at the arithmetic in https://github.com/oconnor663/bao/blob/master/docs/spec.md#storage-requirements …. The 8-byte length counter limits inputs to 2^64 bytes, and the minimum storage overhead to hash that input (in theory) is 1664 bytes, on top of the BLAKE2 state itself. Imposing a tighter height limit doesn't change the output for inputs below your limit, so it's possible that we could leave that up to implementations.

@zooko: Yeah I think you should just leave it at that. Good! Simpler.

Preventing memoization is the one question I'm on the fence about. One example in my head: There's no way to stop a programmer from memoizing any hash function, based on inputs that share the same starting bytes. But does anyone try that (disastrous idea) in practice? Re unknown future applications, there could be a lot of cool things a filesystem could do if hashing large ranges of zeros was free. From the filesystem's perspective that wouldn't be "memoization" per se but rather "interpreting structured input".

@zooko: Agree, but that raises security issues. Bao isn't the right place to innovate on that. Let Bao be fast and secure and widely useful and then let's start a new project for this interesting sparsity/dedup stuff and its weird security issues. I have ideas for you from LAFS!

Thanks so much for posting this thread!

For context, I'm working on an archive format for preservation and transmission of data across mediums, and I am super excited that Bao exists. I need a good tree hash, and I am definitely not up to the task of rolling my own!

You and Zooko definitely know more about this than I do, but I wanted to give some of my own thoughts.

trailing garbage

I like the idea of forbidding trailing garbage, or if that isn't possible, requiring that it be zeroed.

My thinking is that if someone is using bao encoded files in an application, they might use bao to verify that the data matches a hash, and then after that use a separate tool to consume the bao encoded file. An attacker could append malicious data to the end of the file, which would be ignored by the verifier but ingested by the tool.

This attack isn't possible if they use the bao command to stream validated data and ingesting that, but people might not do that.

Also, I definitely prefer that specs constrain behavior as much as possible, so I don't have to think about attacks that might arise from deviations from that behavior.

official blessing

I really like the idea of getting official blessing for Bao and calling it Blake2t. I think that tree hashes aren't more widely used due to lack of a good standard, and bao would get a ton of use and adoption just from being a named, standardized, well regarded hash function. The association with Blake2, which has a good reputation, would help a lot.

It would also help if there were no parameters for the main hash function, so that in the same way that when someone says SHA-256 and you know exactly what hash function they mean, when someone says Bao or Blake2t, you know exactly which hash function they mean. (Aside from maybe 256 bit variant and a 512 bit variant.)

special handling of zeroed data + side channels

In the interest of preventing implementations from leaking data through side channels, I think that I would avoid special handling of runs of zeros. I think it's better to have primitives with the most solid assurances possible (like bao w/little room for side channels), and create new primitives as needed, then to have primitives which require careful thought to understand their properties (bao w/possible side channels), and to have to take care understanding when they aren't suitable.

Finding out that something is too slow and much better than finding out that something isn't secure.

/comment

Thank you so much for Bao! I'm looking forward to watching it develop.

Thanks for the comments. I just pushed a new version of https://github.com/oconnor663/blake2b_simd, so the next step will be to bring blake2s_simd up to speed, and then there will be some big changes to Bao. Reactions to your comments:

My thinking is that if someone is using bao encoded files in an application, they might use bao to verify that the data matches a hash, and then after that use a separate tool to consume the bao encoded file. An attacker could append malicious data to the end of the file, which would be ignored by the verifier but ingested by the tool.

I've done a poor job of illustrating the "trailing garbage problem", and that might've led to some confusion here. Trailing garbage doesn't affect the decoded output, because the decoder stops at the end of the "real" encoding and never reads whatever garbage might've been appended to it. (If it did affect decoded output, that would break the whole security model. Nothing should be able to affect decoded output, except by truncating it with an error.) The decoder knows where to stop because the length prefix at the start of an encoding tells the decoder everything it needs to know about the structure of the tree and therefore the length of the encoded file. Under normal circumstances the decoder will read right up to EOF, but then stop before it actually encounters the EOF. That's why the "trailing garbage" question exists. If there are actually more bytes on the end of the encoding, where EOF should have been, the decoder never sees them.

So with that in mind, I need some help understanding the scenario you brought up. It sounds like:

1) I have a Bao encoded file, and I've confirm that it successfully decodes with the hash H. 2) Now, an attacker modifies that file. 3) Later, I come back to use the file. But instead of using bao decode $H $file, I'm using something else.

What is that "something else"? In my mind, the only valid thing to do with an encoded file is to decode it. (Or to recompute its hash the quick way with bao hash --encoded.) If some program tries to extract data from a Bao encoding without actually decoding it, well yes, that does defeat the whole purpose of the encoding. And at that point you have bigger problems than trailing garbage -- what if the attacker edits data at the front of the file?

I think that I would avoid special handling of runs of zeros

For sure, this is another thing I think I mentioned too hastily in the spec. If anyone writes a Bao implementation that inspects input bytes in any way, that's a terrible security vulnerability, because of the timing attacks you mentioned. The scenario I was thinking of would be where you already have structured data that says "this file is a gigabyte of zeros, followed by the bytes abc" and you want the hash of that virtual file. (For example, maybe you have this structured data because the user created a sparse file, and you are the filesystem.) The Bao design might or might not allow you to efficiently compute the hash of all-zero subtrees in that case. The timing of that operation could certainly reveal how much of the file is zeros, but I think that's true of any operation on sparse files. Whenever your application is compressing or sparse-ifying data, you've basically consented to reveal the structure of that data through timing.

The worry though, is that implementers might abuse this ability in a way that leads to real timing attacks on real files, through some misguided hope of finding an optimization. That would be really bad. I'm still kind of on the fence about it, in part because I don't think real optimizations are possible here in the typical case. Like yes, if you know in advance that your files contain large runs of all zeros, which aren't compressed, then you could optimize by inspecting your files in advance and finding where the runs stop and start. (Thereby creating the timing attack vulnerability we're worried about.) But if you benchmark that strategy on any real world inputs, which don't reliably contain giant runs of all zeros, you'll find that that additional pass you're making over your data slows you down tremendously, in exchange for almost no gain. So the bad implementations we're worried about are both slower and more difficult to build than a safe implementation. However, if we can think of a scenario where the bad implementation is either meaningfully faster, or simpler to code, then I would be more worried.

(Compare for example an "optimization" of SHA2. You could put together some scheme where you put the prefix of every file you've ever hashed in a giant global trie along with its hash state, and when an input shares a prefix with a previous input, you avoid the work of hashing the prefix. If you're hashing a ton of common prefixes, in theory this could save you a lot of work, at the cost of creating a big timing attack. But in practice I don't think anyone would ever do this.)

Edit 2019-06-06: My current thinking along these lines is: We could try to use node_offset to prevent implementers from harmfully caching subtrees, out of an abundance of caution. But we can't actually stop them from trying to cache. All we can do is make it so that it doesn't give them any performance benefit, by making it impossible to get a cache hit within a file (though not across multiple files). Yet by the same token, subtree caching already gives no performance benefit on real inputs, and in fact causes quite a lot of performance degradation, even when cache hits are theoretically possible.

Edit 2019-08-07: I've learned two things from testing. 1) The evil "check for empty chunks" strategy doesn't have a performance hit in practice, because a non-constant-time chunk will almost always short circuit after just a couple bytes. 2) The node_offset "disincentive" strategy also doesn't have a performance hit in practice. So I'm leaning more strongly towards including the mitigation now.

I made a repository with an instance of what that "something else" might be. It's definitely a bit contrived!

The repo contains a shell script named good.sh that Alice wants to encode and store, and retrieve from storage, check that it still hashes to the same value, and run it.

The attack is demonstrated in this file. It's a justfile, which is basically like make but it respects shebangs.

There are a few steps:

In setup Alice encodes good.sh and produces good.sh.bao.
In attack Mallory appends bad.sh, a malicious script to the end of good.sh.bao.
In check Alice checks that the hash of the encoded file is still the same, which it is, even though it now has the malicious script appended to the file.
In run Alice uses a naive python script to read the length of the encoded file's payload, extracts the payload by taking that many bytes from the end of good.sh.bao, and executes the extracted payload.

Alice's script is definitely a bit silly, but it can be tricked into running bad.sh even if she does check before run.

@casey let's split discussion of that scenario into this separate issue, https://github.com/oconnor663/bao/issues/24.

Closing this issue now that BLAKE3 is public :tada: :tada: :tada:

oconnor663 / bao