Alice eats out of the garbage

oconnor663 commented 5 years ago

@casey suggested this scenario in #21:

In setup Alice encodes good.sh and produces good.sh.bao.

In attack Mallory appends bad.sh, a malicious script to the end of good.sh.bao.

In check Alice checks that the hash of the encoded file is still the same, which it is, even though it now has the malicious script appended to the file.

In run Alice uses a naive python script to read the length of the encoded file's payload, extracts the payload by taking that many bytes from the end of good.sh.bao, and executes the extracted payload.

I'm splitting my thoughts into this separate issue.

oconnor663 commented 5 years ago

A couple of clarifications I think we should make first:

The bao hash --encoded command in the check step doesn't do what you might think it does. It doesn't validate an entire encoded file actually matches the hash in question. It only reads the root node out of the encoding and re-hashes that one node, to re-produce the root hash. (In this short input example, the root node does happen to be the whole thing though.) We could replace it with bao decode to get the effect you intended.
Slicing content bytes off the end of the encoding also only appears to work for inputs shorter than two chunks (8192 bytes). Any longer than that and you get parent nodes interspersed inside your content.

With that in mind, a couple of specific responses to the scenario:

Even in a world where trailing garbage is forbidden, Alice has big problems here. The second issue above means she's decoding the wrong bytes above size 8192, and Mallory can exert some control over what those wrong bytes are.
Alice also has a nasty race condition, if Mallory can manage to modify the encoding in between when Alice verifies it and when she decodes it.

More generally, I think Alice is committing a cardinal sin by not checking hashes during decoding. If you do check hashes, it really doesn't matter where you get the bytes you check. You can pull them from trailing garbage, you can pull them from /dev/urandom, whatever. If they match the hash they're supposed to match, they must be the correct bytes. (Or else you've found a collision in BLAKE2.)

casey commented 5 years ago

That makes sense, thanks for clarifying. I think that I assumed that the bao command would work more like the shasum command, i.e. that it could be used to check the hash of a file before using it with some other tool. It might be the case that this is a common misconception, or it might not be.

I do think that many users will assume that bao hash --encoded FILE will check the whole file, instead of just the root hash, as I did, which might make them vulnerable, even if it is based on an incorrect assumption that reading the documentation first would prevent.

What scenarios did you have in mind where allowing arbitrary bytes at the end of an encoded file would be desirable?

oconnor663 commented 5 years ago

I think that I assumed that the bao command would work more like the shasum command, i.e. that it could be used to check the hash of a file before using it with some other tool.

Certainly you can use bao hash just like you would use sha1sum. If you run a command like bao hash *, you'll see that the output format is pretty identical to sha1sum *. It's bao encode and bao decode that are venturing into new territory, mostly to support streaming applications. If you have a whole file, and you're going to read the whole file, there's not much of a reason to use anything besides bao hash.

I do think that many users will assume that bao hash --encoded FILE will check the whole file, instead of just the root hash, as I did, which might make them vulnerable, even if it is based on an incorrect assumption that reading the documentation first would prevent.

That's a good point, and your scenario made me think about it a little bit more. I think you're right that bao hash --encoded is likely to be confusing. But I think what saves us here is that at the end of the day, the only way you're going to consume an encoded file is bao decode, and bao decode checks everything. So even if users have mistaken ideas about what bao hash --encoded is doing, I'm not sure there's any way to attack them. (Edit: I went ahead and opened https://github.com/oconnor663/bao/issues/25, since there's no downside to picking a better name.)

(Of course if the user implements some substitute decode command that doesn't actually check anything, then they can hurt themselves. But the same is true of any cryptographic protocol right?)

What scenarios did you have in mind where allowing arbitrary bytes at the end of an encoded file would be desirable?

It's not that it's desirable, but more that it's tricky to come up with a way to prevent it.

Bao puts the length of the encoded content at the front of the encoding. That one number tells the decoder everything about the structure of the tree. So under normal circumstances, the decoder never encounters EOF, because it already knows exactly how many bytes it's going to read. If there happen to be garbage bytes sitting there, instead of EOF, the decoder will never see them, because it doesn't ask.

Now we could say that after a successful decoding, the decoder has to issue "one more read" just to confirm that there's nothing else. But that raises some issues:

Are applications allowed to concatenate Bao encodings together into a stream? The length at the front of the encoding format makes it self-delimiting, so in theory this works just fine. But if the decoder reads past the end of an encoding to look for EOF, then it'll chop bytes off the front of the next one in the stream, and this self-delimiting strategy doesn't work. It's not that I think a lot of realistic applications are going to do this, but it's not clear to me that we gain much from breaking it either.
More realistically, how likely is it that Bao implementations in the wild will get EOF checking right? This behavior will never matter up in normal operation. If the implementer forgets this extra step at the end, probably all of their tests will still pass.

That second problem is the one that bothers me more. If we try to guarantee that there's no trailing garbage, it's because we expect that some users somewhere will want to rely on that guarantee, right? I'd be worried about those users. Eventually they'll port their software to Turbo Pascal, and the programmer in charge of the Turbo Pascal Bao implementation will have forgotten to add the trailing garbage check, and they'll be vulnerable. Sometimes no promise at all is better than a promise that we might break.

In my head, I invoke the same rule here that I did above. Ultimately, the only way to consume an encoded file is to decode it, and if you do that, you don't care about whether or not the tail is malleable. (Or whether or not the length prefix is malleable, see here.)

casey commented 5 years ago

That's a good point, and your scenario made me think about it a little bit more. I think you're right that bao hash --encoded is likely to be confusing. But I think what saves us here is that at the end of the day, the only way you're going to consume an encoded file is bao decode, and bao decode checks everything. So even if users have mistaken ideas about what bao hash --encoded is doing, I'm not sure there's any way to attack them.

To reduce confusion, one option would be to use a different command for bao hash --encoded FILE, to make it clear that unlike bao hash, it isn't verifying the full contents of the file. Maybe something like bao get-root-hash ENCODED_FILE.

You bring up good points. I think I might expect the command line tool to differ from the library, as in, I might expect the tool to verify an entire file, but I would certainly not expect bao::decode(stream: &mut io::Read) to perform additional reads past the end of the encoded file.

However, making the library do different things from the command might not be worth the inconsistency.

Slicing content bytes off the end of the encoding also only appears to work for inputs shorter than two chunks (8192 bytes). Any longer than that and you get parent nodes interspersed inside your content.

Just going back to this for a second, is this true? In the spec it looks like there's a tree with three chunks, and the combined encoding shows the three chunks appearing in-order at the end of the encoding.

oconnor663 commented 5 years ago

In the spec it looks like there's a tree with three chunks, and the combined encoding shows the three chunks appearing in-order at the end of the encoding.

Oh you're totally right. The first parent node in the middle shows up when you grow past 3 chunks. At that point, the parent will be to the left of the last 2 chunks, which I think is how I confused myself.

oconnor663 commented 5 years ago

Now that bao hash --encoded is entirely gone, I'm going to close this. Feel free to reopen though.

oconnor663 / bao

Alice eats out of the garbage #24