Closed oconnor663 closed 5 years ago
A couple of clarifications I think we should make first:
bao hash --encoded
command in the check
step doesn't do what you might think it does. It doesn't validate an entire encoded file actually matches the hash in question. It only reads the root node out of the encoding and re-hashes that one node, to re-produce the root hash. (In this short input example, the root node does happen to be the whole thing though.) We could replace it with bao decode
to get the effect you intended.With that in mind, a couple of specific responses to the scenario:
More generally, I think Alice is committing a cardinal sin by not checking hashes during decoding. If you do check hashes, it really doesn't matter where you get the bytes you check. You can pull them from trailing garbage, you can pull them from /dev/urandom
, whatever. If they match the hash they're supposed to match, they must be the correct bytes. (Or else you've found a collision in BLAKE2.)
That makes sense, thanks for clarifying. I think that I assumed that the bao
command would work more like the shasum
command, i.e. that it could be used to check the hash of a file before using it with some other tool. It might be the case that this is a common misconception, or it might not be.
I do think that many users will assume that bao hash --encoded FILE
will check the whole file, instead of just the root hash, as I did, which might make them vulnerable, even if it is based on an incorrect assumption that reading the documentation first would prevent.
What scenarios did you have in mind where allowing arbitrary bytes at the end of an encoded file would be desirable?
I think that I assumed that the
bao
command would work more like theshasum
command, i.e. that it could be used to check the hash of a file before using it with some other tool.
Certainly you can use bao hash
just like you would use sha1sum
. If you run a command like bao hash *
, you'll see that the output format is pretty identical to sha1sum *
. It's bao encode
and bao decode
that are venturing into new territory, mostly to support streaming applications. If you have a whole file, and you're going to read the whole file, there's not much of a reason to use anything besides bao hash
.
I do think that many users will assume that
bao hash --encoded FILE
will check the whole file, instead of just the root hash, as I did, which might make them vulnerable, even if it is based on an incorrect assumption that reading the documentation first would prevent.
That's a good point, and your scenario made me think about it a little bit more. I think you're right that bao hash --encoded
is likely to be confusing. But I think what saves us here is that at the end of the day, the only way you're going to consume an encoded file is bao decode
, and bao decode
checks everything. So even if users have mistaken ideas about what bao hash --encoded
is doing, I'm not sure there's any way to attack them. (Edit: I went ahead and opened https://github.com/oconnor663/bao/issues/25, since there's no downside to picking a better name.)
(Of course if the user implements some substitute decode
command that doesn't actually check anything, then they can hurt themselves. But the same is true of any cryptographic protocol right?)
What scenarios did you have in mind where allowing arbitrary bytes at the end of an encoded file would be desirable?
It's not that it's desirable, but more that it's tricky to come up with a way to prevent it.
Bao puts the length of the encoded content at the front of the encoding. That one number tells the decoder everything about the structure of the tree. So under normal circumstances, the decoder never encounters EOF, because it already knows exactly how many bytes it's going to read. If there happen to be garbage bytes sitting there, instead of EOF, the decoder will never see them, because it doesn't ask.
Now we could say that after a successful decoding, the decoder has to issue "one more read" just to confirm that there's nothing else. But that raises some issues:
That second problem is the one that bothers me more. If we try to guarantee that there's no trailing garbage, it's because we expect that some users somewhere will want to rely on that guarantee, right? I'd be worried about those users. Eventually they'll port their software to Turbo Pascal, and the programmer in charge of the Turbo Pascal Bao implementation will have forgotten to add the trailing garbage check, and they'll be vulnerable. Sometimes no promise at all is better than a promise that we might break.
In my head, I invoke the same rule here that I did above. Ultimately, the only way to consume an encoded file is to decode
it, and if you do that, you don't care about whether or not the tail is malleable. (Or whether or not the length prefix is malleable, see here.)
That's a good point, and your scenario made me think about it a little bit more. I think you're right that bao hash --encoded is likely to be confusing. But I think what saves us here is that at the end of the day, the only way you're going to consume an encoded file is bao decode, and bao decode checks everything. So even if users have mistaken ideas about what bao hash --encoded is doing, I'm not sure there's any way to attack them.
To reduce confusion, one option would be to use a different command for bao hash --encoded FILE
, to make it clear that unlike bao hash
, it isn't verifying the full contents of the file. Maybe something like bao get-root-hash ENCODED_FILE
.
You bring up good points. I think I might expect the command line tool to differ from the library, as in, I might expect the tool to verify an entire file, but I would certainly not expect bao::decode(stream: &mut io::Read)
to perform additional reads past the end of the encoded file.
However, making the library do different things from the command might not be worth the inconsistency.
Slicing content bytes off the end of the encoding also only appears to work for inputs shorter than two chunks (8192 bytes). Any longer than that and you get parent nodes interspersed inside your content.
Just going back to this for a second, is this true? In the spec it looks like there's a tree with three chunks, and the combined encoding shows the three chunks appearing in-order at the end of the encoding.
In the spec it looks like there's a tree with three chunks, and the combined encoding shows the three chunks appearing in-order at the end of the encoding.
Oh you're totally right. The first parent node in the middle shows up when you grow past 3 chunks. At that point, the parent will be to the left of the last 2 chunks, which I think is how I confused myself.
Now that bao hash --encoded
is entirely gone, I'm going to close this. Feel free to reopen though.
@casey suggested this scenario in #21:
I'm splitting my thoughts into this separate issue.