Stop hashing the length suffix in the root node?

Now that we're hewing to the security proof framework in "Sufficient conditions for sound tree and sequential hashing modes", we use domain separation between leaves and parents, and separately between the root and non-root nodes, to prevent collisions and length extensions. The length suffix on the root node might no longer be necessary. (That is, the hashed suffix that we treat as a finalization parameter. The length prefix at the front of the encoding will always be necessary.) Apart from being an extra moving part in the design, it also represents some overhead at very short input lengths. (For example, an input exactly one BLAKE2 block long becomes two blocks with the suffix added.) The alternative would be to leave the length out of hashing, so that it's still prepended in the encoding format, but no longer directly validated by checking the root node.

Objection 1: There is some benefit to hashing the length. It's possible (after the implementation validates the root node) to report the "expected" length of an input in a way that isn't malleable by a MITM.

However, this guarantee is quite subtle, and probably hard to understand or make use of in practice. Here's vaguely what we could guarantee:

You can take the reported len() as the guaranteed truth about the length that the person who prepared the encoding wanted to report to you. But the stream you're reading might hit truncation or corruption before then, so don't expect to necessarily get all those bytes. And it might be an attacker-prepared encoding that deliberately mis-reports its length, in a way that isn't discoverable until the very last leaf fails validation, or perhaps (more on this below) isn't discoverable at all. And by the way, please never parse the first 8 bytes of the stream as a u64 yourself, as tempting as that might be, because then you have no guarantees whatsoever.

And all those caveats are still assuming that the implementation is careful to check the root node before reporting the length. It's a very tempting implementation pitfall for streaming decoders that want to expose the length to the caller, to do so as soon as they've parsed the first 8 bytes, rather than witholding it until the root node passes muster. So even if we expected callers to 1) understand what's being guaranteed about the validated length and to 2) find those guarantees useful -- both of which are questionable -- the overriding problem might be that Bao implementations in the wild aren't going to uphold these guarantees.

One particular version of this attack that's implementations are especially likely to overlook: An attacker prepares an encoding that records a length of 10 in the length prefix, but in fact the sole chunk of input only contains 9 bytes. The attacker computes the root hash with those 9 bytes, suffixed by this fake reported length of 10. (A real encoder would never do this, because it counts the actual bytes you feed it rather than asking you separately for the length, but an attacker is free to do this.) Now, what happens when that encoding arrives at a decoder? After reading the length prefix, the decoder is going to try to read 10 further bytes from the stream. In Rust, this is likely to fail, because APIs like Read::read_exact return errors for short inputs. But in Python, a short read will succeed and simply return fewer bytes than requested. If the encoder hashes the bytes that it was given, without double-checking that they match the length it asked for, the whole decoding will probably succeed too. (This is why the verify_node function in tests/bao.py includes a node_size argument. It has to explicitly check how many bytes it got. It's very easy to forget this in a Python implementation.)

Is the scenario above a security issue? I think it is. Not because it allows a MITM to mutate the input after the fact (it doesn't), but because it creates a sort of backwards collision. Because this weird attacker-produced hash is different from the one that a regular encoder would generate, there's now more than one hash that can successfully decode the same input. Or from another angle, there's now an encoding that successfully decodes with something other than the straightforward hash of its output. It's not a collision in the usual sense of the term (it's not two inputs giving the same hash), but it's still something that callers might assume is impossible. Imagine a hash table of encodings indexed by their hashes. If you hash some input, and find that its hash isn't present as a key in the table, can you assume that no encoding in the map will decode to that input? I think you should be allowed to assume that, and any Bao design that can't reliably guarantee that property in the wild is problematic.

So the most compelling reason not to include the length suffix when hashing the root node, is that it's ultimately safer for us to make fewer guarantees about the length, and probably to get rid of the len() API on the decoder entirely.

Objection 2: Removing that implementation pitfall is above is all well and good. But doesn't this increase the danger than implementation might screw up the length zero case?

No, but let's explain that case. The idea is that I take an encoding of some non-zero length, and I tweak the length prefix to be zero. What happens inside the decoder? The danger is that an incorrect decoder might parse only the length and then say something like "The number of bytes I've emitted so far is zero, and that's equal to the number I'm expecting to emit, so I should just return a successful EOF." But since the real encoding wasn't supposed to have EOF there, that's full-blown malleability by the attacker. A correct decoder always has to verify at least the root node, even if that node is the empty chunk. Hashing the length as a root suffix, or not, doesn't change that. This is a serious implementation pitfall, possibly the most serious one I know of, but it's there either way.

Objection 3: What about seeking? In particular, what if I seek past EOF? Previously we relied on hashing the root node to guarantee that the length was valid, and that it was safe to return EOF if I seek past the end.

This is subtle and tricky, and I'm going to need someone to check my work, but I believe that with a change to the required behavior of the decoder we're still safe here. Previously the rule was "before seeking past the end (and potentially then giving the caller a successful EOF) you must validate the root node." Now the rule would be "before [same] you must validate the final chunk."

The main attack to consider is one where the MITM shortens the reported length of an encoding. This is the case that turns what should've been non-empty reads into potential spurious EOFs. This could do one of two things to the tree. 1) With a small tweak, it could shorten the final chunk. That would cause a hash mismatch for that chunk, assuming BLAKE2 is secure against collisions. 2) With a larger tweak, it could shrink the structure of the tree, reinterpreting one of the parent nodes along the right edge as the final chunk. That would also cause a hash mismatch, because chunks and parents are domain separated.

The other attack, where the MITM increases the reported length of an encoding, is less interesting. As noted in Objection 1 above, what happens here depends on whether the underlying IO functions are permissive about short reads. It's possible that decoding could succeed, with the quirks of permissive IO "reinterpreting" the tweaked length as the original. Or in a less permissive language it could fail with an IO error. Or (if trailing garbage is available to supply longer reads), it could cause a hash mismatch similar to above.

If a given encoding could decode to different output on two different implementations, that would be a security disaster. If it decodes successfully on one implementation, but fails on another, is that a disaster? It's certainly not ideal. But ultimately I don't think it's within our power to prevent every inconsistency like this. The IO functions in different languages simply have different behavior, and the natural tendency of implementers to rely on that behavior will lead to inconsistencies in the wild. We could beg implementers to wrap their IO logic in some consistent way, but at the same time there's a danger of encouraging callers to rely on something that we can't realistically expect implementers to test thoroughly. The best we can hope for is a design where this failure mode doesn't violate the essential guarantees of the protocol.

Side question: Could a strict security requirement like "any decoding that succeeds in one implementation must succeed in all implementations" be realistic? I'm not sure. Even if we successfully forbade inaccurate lengths and trailing garbage, to avoid any difference in behavior around those things, it's still possible for a large encoding to exhaust memory in one (non-streaming) implementation but not in another.

Objection 4: An encoded stream would be malleable now, in a way it didn't used to be.

Yes, as noted above, depending on how permissive your language is with IO, it might be possible to increase the reported length without breaking decoding, for some decoders. But remember that it's always been possible to add trailing garbage without breaking decoding, as most decoders will never read past the expected end of an encoding at all. Notably, neither of these cases allow an attacker to change the output of a successful decoding.

In addition to issues with "~~reverse~~ decoding collisions", there are also some tricky considerations around what it means to use SeekFrom::End. Here's some draft language that I'm planning on sticking in spec.md.

After parsing the length from the first eight bytes of an encoding, the decoder traverses the tree by reading parent nodes and chunk nodes. The decoder verifies the hash of each node as it's read, and it may return the contents of each valid chunk to the caller immediately. The length itself is considered validated when and only when the decoder validates the final chunk, either by validating the entire encoding or by seeking to the end and validating only the right edge of the tree. The decoder must not expose the length to the caller in any way before the final chunk is validated. There are a number of ways the decoder might expose the length, some of which are less obvious than others:

An explicit .length() method. The reference implementation doesn't include one, because it would be required to seek internally. Callers who need the length in advance will usually do better to store it separately along with the hash.
Reading the empty encoding. Any read of the empty encoding reports EOF, thereby exposing the length (zero). The decoder must verify that the final chunk (that is, the empty chunk) matches the root hash. Most implementations will naturally satisfy this requirement for non-empty encodings as part of reading chunk bytes, but it's easy to forget it in the empty case if you write code like "current position equals content length therefore EOF."
Seeking past the end. If I seek to an offset greater than or equal to the content length, the next read will return EOF, exposing the length. That means either the seek itself or the following read must have validated the final chunk.
Seeking relative to the end. Most seek implementations support something akin to SeekFrom::End. That exposes the length through the absolute offset returned by the seek, which means the seek itself must validate the final chunk.

For decoders that don't expose a .length() method and don't support seeking, the security requirements can be simplified to "verify the hash of every node you read, and don't skip the empty chunk." But decoders that do support seeking need to consider the final chunk requirement very carefully.

Would hashing the length as associated data improve the security of the decoder?

No, not for a correct decoder. An attacker modifying the length bytes can't produce any observable result, other than the errors that are also possible by modifying or truncating the rest of the encoding. The story is more complicated if we consider "sloppy" implementations that accept some invalid encodings, in which case hashing the length could mitigate some attacks but would also create some new ones. An earlier version of the Bao design did hash the length bytes, but the current design doesn't.

Let's start by considering a correct decoder. What happens if a man-in-the-middle attacker tweaks the length header in between encoding and decoding? Small tweaks change the expected length of the final chunk. For example, if you subtract one from the length, the final chunk might go from 4096 bytes to 4095 bytes. Assuming the collision resistance of BLAKE2, the 4095 byte chunk will necessarily have a different hash, and validating it will lead to a hash mismatch error. Adding one to the length would be similar. Either no additional bytes are available at the end to supply the read, leading or an IO error, or an extra byte is available somehow, leading to a hash mismatch. Larger tweaks have a bigger effect on the expected structure of the tree. Growing the tree leads to chunk hashes being reinterpreted as parent hashes, and shrinking the tree leads to parent hashes being reinterpreted as chunk hashes. Since chunks and parents are domain separated from each other, this also leads to hash mismatch errors in the tree, in particular always including some node along the right edge.

Those observations are the reason behind the "final chunk requirement" for decoders. That is, a decoder must always validate the final chunk before exposing the length to the caller in any way. Because an attacker tweaking the length will always invalidate the final chunk, this guarantees that the modified length value will never be observed outside of the decoder. Length tweaks might or might not invalidate earlier chunks before the final one, and decoding some chunks might succeed before the caller eventually hits an error, but that's indistinguishable from simple corruption or truncation at the same point in the tree.

So, what happens if the decoder gets sloppy? Of course the implementation could neglect to check any hashes at all, providing no security. Assuming the implementation does check hashes, there are couple other subtle mistakes that can still come up in practice (insofar as I made them myself in early versions of the reference implementation).

The first one, we just mentioned: failure to uphold the "final chunk requirement". As a special case, recall that the empty tree consists of a single empty chunk; if the decoder neglects to validate that empty chunk and skips right to its EOF state, then the empty encoding wouldn't actually validate anything at all, making it appear valid under any root hash. More generally, if the decoder seeks past EOF or relative to EOF without validating the final chunk first, an attacker could effectively truncate encodings without detection by shortening the length, or change the target offset of EOF-relative seeks.

The other likely mistake is "short reads". The IO interfaces in most languages are based on a read function which usually returns as many bytes as you ask it to but which may return fewer for any reason. Implementations need to either call such functions in a loop until they get the bytes they need, or use some higher level wrapper that does the same. Implementations that neglect to call read in a loop will often appear to work in tests, but will be prone to spurious failures in less common IO settings like reading from a pipe or a socket. This mistake also opens up a class of attacks. An attacker might modify the length header of an encoding, for example creating an encoding with 9 content bytes but a length header of 10. In this case, a correct decoder would fail with an unexpected-end-of-file error, but an incorrect decoder might "short read" just the 9 bytes without noticing the discrepancy and then successfully validate them. That exposes the caller to inconsistencies that the attacker can control: The length of all the bytes read (9) doesn't match the offset returned by seeking to EOF (10), and like the "final chunk requirement" issue above, an attacker can again change the target offset of EOF-relative seeks.

With those two classes of attacks in mind, we can come back to the original question: Would hashing the length as associated data (probably as a suffix to the root node) mitigate any of the attacks above for sloppy implementations? The answer is some yes and some no.

The most egregious "final chunk requirement" vulnerability above -- validating nothing at all in the case of an empty encoding -- remains a pitfall regardless of associated data. Seek-past-EOF also remains a pitfall but in a slightly modified form: the implementation might detect the modified length if it validates the root node before seeking, but forgetting to validate the root node would be the same class of mistake as forgetting to validate the final chunk. However, the decoder would do better in any scenario where you actually tried to read chunk data; getting to a chunk means validating the root node on your way down the tree, and bad associated data would fail validation at that point.

The "short reads" vulnerabilities above would also be partially mitigated. In a scenario where the attacker corrupts a "legitimate" encoding by changing its length header after the fact, hashing the length as associated data would detect the corruption and prevent the attack. But in a scenario where the attacker constructs both the encoding and the hash used to decode it, the attacker may craft an "illegitimate" root hash that expects an inconsistent length header. (A typical encoder doesn't expose any way for the caller to put an arbitrary value in the length header, but there's nothing stopping an attacker from doing it.) In this case the inconsistent length pitfall would remain: the root node would validate with the bad length as associated data, the final chunk would validate with a short read, and once again the length of all the bytes read wouldn't match the offset returned by seeking to EOF.

If that were the whole story -- that hashing the length as associated data mitigated some attacks on sloppy implementations -- that would be some motivation to do it. However, that last scenario above actually leads to a new class of attacks, by violating Bao's "no decoding collisions" guarantee. That is, no input should ever decode (successfully, to completion) under more than one hash. (And naturally the one hash an input does decode under must be the hash of itself.) The illegitimate encoding above and its exotic root hash constitute a "decoding collision" with the legitimate encoding they're derived from. To put yourself in the shoes of a caller who might care about this property, imagine you have a dictionary containing Bao encodings indexed by the hashes that decode them. If you find that a given string's hash isn't present as a key in your dictionary, is it safe for you to assume that no encoding in your dictionary will decode to that string? Bao says yes, you may assume that. And maybe equally importantly, callers in that scenario will assume that without bothering to read the spec. In this sense, including the length as associated data would actually make sloppy implementations less secure, by giving attackers a way to create decoding collisions.

Earlier versions of Bao did append the length to the root node, instead of using a chunk/parent distinguisher. A proper distinguisher (the node_depth initialization parameter) was added later, both to better fit the Sufficient conditions framework and to avoid issues around integer overflow. At that point the length suffix was redundant, and it also incurred some performance overhead in the short message case, where a one-block message would require two blocks of hashing. It was dropped mainly for that performance reason, since the sloppy implementation concerns above aren't decisive in either direction.

oconnor663 / bao

Stop hashing the length suffix in the root node? #22

Would hashing the length as associated data improve the security of the decoder?