penumbra-zone / penumbra

Penumbra is a fully private proof-of-stake network and decentralized exchange for the Cosmos ecosystem.
https://penumbra.zone
Apache License 2.0
376 stars 294 forks source link

TCT makes use of field element internal representation in a potentially error-prone way #3743

Closed cronokirby closed 7 months ago

cronokirby commented 8 months ago

https://github.com/penumbra-zone/penumbra/blob/59cdd7552dd4e50aea948328a79bc034c599fa1e/crates/crypto/tct/src/internal/hash.rs#L145

This line uses the fact that Fq(2^256 - 1), using the arkworks implementation of that field, will return an "element" which is not equal to any other. Mathematically, this isn't a well defined operation considering Fq as a generic abstraction of that field, since that element will be equal to Fq(2^256 - 1 mod q) as well.

This might pose an issue when we start migrating the repo to use the latest version of decaf377, which contains a new backend, where a method to construct a field element from an unchecked internal representation may not be provided, or may not behave in the same way.

cronokirby commented 8 months ago

A very simple fix for this would be to instead pick a random element of Fq, and use that as a hardcoded constant here. (This constant could be defined in the decaf377 crate). The likelihood of that element ever getting produced by some other piece of code, including this, is indistinguishable from 0.

plaidfinch commented 8 months ago

It's a good point, noticing that the TCT treats Fq a little bit loosely. However, I don't think it's a bug; it's a deliberate series of design choices which were intended to address specific requirements. First I want to respond to your points, then afterwards provide a bit more context for why I recommend what I do.

A very simple fix for this would be to instead pick a random element of Fq, and use that as a hardcoded constant here. (This constant could be defined in the decaf377 crate). The likelihood of that element ever getting produced by some other piece of code, including this, is indistinguishable from 0.

I don't like this fix because the probability, while tiny, is not zero. The TCT, unlike most of the rest of the Penumbra system, is an inductively defined data structure with relatively straightforward semantics for its relatively small and self-contained interface. It ought to be tractable (in both the normative and the descriptive senses of "ought") to formally prove various correctness theorems about its behavior. Reducing this correctness guarantee to a mere probabilistic one would complicate such reasoning significantly, and I don't want to do a quick patch that rules out formally verifying the heart of the shielded pool.

This might pose an issue when we start migrating the repo to use the latest version of decaf377, which contains a new backend, where a method to construct a field element from an unchecked internal representation may not be provided, or may not behave in the same way.

Have we checked that it does not provide such a method? If such a method is not provided, my recommendation is that we make a change to add it, and regardless we should make sure it behaves the same way, and document our reliance on this fact. I would not recommend making any further changes to the TCT to address this concern; in my opinion, none are needed. I believe the only guarantee we need for the new backend's unchecked conversion operation is that equality commutes with the unchecked conversion operation, i.e. Fq(n) == Fq(m) if and only if n == m, and that we know a value that is out of bounds to use as a sentinel. I would be very surprised if it was impossible, or even difficult, to add this operation, if it does not already exist.

Now, some notes on why the TCT is the way it is, and why I think it's a good idea to continue using this method of implementation:

Out-of-range inhabitants of the Fq Rust type are used, on purpose, in several key places in the TCT codebase. Most notable are the following:

https://github.com/penumbra-zone/penumbra/blob/06568d772fa6f8043a48d1feac0e6be9d007a6f9/crates/crypto/tct/src/internal/hash/option.rs#L7-L17

This is in turn used in the CachedHash type, an essential performance optimization to permit the internal hashes of the frontier of the tree to be lazily evaluated:

https://github.com/penumbra-zone/penumbra/blob/06568d772fa6f8043a48d1feac0e6be9d007a6f9/crates/crypto/tct/src/internal/hash/cache.rs#L9-L13

These uses could in principle be replaced with Option<Hash> at the cost of the niche-filling memory compaction that the OptionHash type provides. I do not think that we should forego this optimization and risk messing things up; as is, everything here is self-contained and easy to verify that it is correct, provided that the internal representation of Fq (and thereby Hash) stays consistent, which per semver it must, because it is a public field of Fq.

However, another use case of an out-of-bounds Fq-type value cannot be replaced even in principle with Option<Fq>, except by completely redesigning an entire large piece of functionality (which I personally would not recommend for the purpose of minimizing engineering effort and preserving existing demonstrably correct code): the use of an uninitialized hash value during out-of-order deserialization. When we first start loading hashes and commitments, we need to create a tree that is advanced to the correct position, without knowing what its contents will be. This is only possible if we can fill in TBD values for hashes (commitments can all be set to forgotten, then "unforgotten" during loading).

https://github.com/penumbra-zone/penumbra/blob/06568d772fa6f8043a48d1feac0e6be9d007a6f9/crates/crypto/tct/src/storage/deserialize.rs#L53-L63

Notice how we call the method OutOfOrder::uninitialized, which gives us a deliberately invalid Tree, invalid in that it violates the requirement that it be merkle. As we progress through the initialization process, we fill in all that the serialization contains of the commitments and any stored internal hashes, then finally we step through the tree and fill in any still-uninitialized hashes with newly computed values based on their children. There are several things to note here: (1) we need to be able to distinguish between uninitialized hashes and other hashes, because filling in only uninitialized hashes permits us to trade off storage of internal hashes for avoided recomputation; (2) not every set of (position, height)-indexed internal and leaf hashes and leaf commitments can possibly be deserialized into a valid tree, because if a position exists in the structure inferred which must be a leaf and no commitment nor leaf hash is stored for that position, we cannot guess what it needs to be (and therefore we cannot guess what any ancestor node hashes need to be, should they be omitted).

Zooming out, the reason deserialization is done this way (out of order, using uninitialized values) is to ensure that we do not need to morally copy the entire serialization of the tree into some intermediate structure in memory before turning it into the final tree structure. Instead, we can stream values in any order from storage directly into their final in-memory representation, which means we use half the memory during deserialization than otherwise we would. The TCT is deliberately designed for serialization to be incremental and for deserialization to be efficient in time, memory, and I/O operations (not mentioned above is how this design also lets you read the tree storage database in one long sequential pass, rather than a bunch of random reads). In summary, the use of these niche out-of-range values is done to reduce memory usage both during use (the former case) and at load time (the latter case).

Happy to chat more about this sync or async :)

cronokirby commented 8 months ago

These are great points. Another great point raised by @plaidfinch in a discussion outside of Github was that this sentinel value can't be deserialized from bytes with a method that checks that the serialization of a field element is canonical. This safeguard against accidentally getting the sentinel value when parsing wouldn't be present with a random field element.

cronokirby commented 8 months ago

A fun point about semver is that arkworks is 0.4, so in theory we could get rug pulled, but oh well 8^)

hdevalence commented 7 months ago

Where did we end up on this?

TalDerei commented 7 months ago

Where did we end up on this?

https://github.com/penumbra-zone/decaf377/pull/88/commits/db0124858be1554f7aee7e85286524d44fa9300b

hdevalence commented 7 months ago

OK, great! Wondering is this issue closed or is there still work we need to do?

TalDerei commented 7 months ago

OK, great! Wondering is this issue closed or is there still work we need to do?

This issue can be closed!