Abstracted storage representation for Solidity smart contracts

ehildenb commented 10 months ago

An idea from @geo2a about having our storage representation change to allow for more abstract reasoning at the level of solidity data structures. Other tools (like Halmos) do this, which costs generality in terms of handling arbitrary EVM bytecode, but gains in speed/efficiency.

Change the <storage> cell in KEVM to be of new type Storage, which is super-sort of Map. Change #lookup to handle this as input and add a new #write which writes it (and adjust all lemmas to be over these new constructs). Was attempted:
In Kontrol, extnend the Storage sort with new SolidityStorage subsort, which accomodates Solidity contract fields. The field names can be the first keys in the overall map, and the second keys can be mapping indices, or struct member names, etc...
In Kontrol, where we generate storage-layout lemmas, instead generate ones that go the "other direction". Basically, they should look at storage reads/writes (SSTORE and SLOAD), and "reverse" the keccak expressions being passed into them to find their high-level Solidity counterparts. These reversing rules can be generated automatically in the same way we generate the sugar rules now (which are not used), by just reversing the rules we generate.

This will provide:

Better looking storage layouts by default, easier to map back to Solidity storage values.
Perhaps simpler reasoning than having these massive keccak expressions that are passed around a bunch? This may also simplify and speed up the backend.

hjorthjort commented 8 months ago

A further discussion with @RaoulSchaffranek and @PetarMax today about how to deal properly and soundly with lookups into storage in the precense of arrays and mappings yielded the following ideas. They seem to be mutually benefical and act at different levels of the stack.

Background

We anticipate, based on the STXN engagement, that we at some point will need to support importing storage and code from chain rather than build it all from scratch.

The Solidity compiler (since a couple of versions back) treats storage in an essentially "unsound" way, where it assumes infinite storage is available. This is close-enough-to-true assuming that keccak is truly collision-resistant since contract storage is $2^256$ words large.

Every storage read and updatate (SREAD and SSTORE) that is generated by the Solidity compiler will either be to a specific storage slot that is precomputed (myUint and myManyUints below) or the result of an addressing into an array (myArray) or mapping (myMapping). It's possible for a developer to write code that addresses without using these Solidity language-level constructs, but it would be unusual (and essentially malicious). But we should not completely discount the possibility, just try to avoid reasoning about it unnecessarily.

contract MyContract {
  uint myUint;  // slot 0
  uint[5] myUints;  // slot 1-6
  uint64[] myArray;  // starts at slot keccak(7), can grow arbitrarily large.
  uint[] myArray2; // stars at slot keccak(8), can grow arbitrarily large.
  mapping (address => uint256) myMapping  // stored in all kinds of different slots, given by keccak(uint256(addr), 9), further information here: https://docs.soliditylang.org/en/v0.8.23/internals/layout_in_storage.html#mappings-and-dynamic-arrays
}

The prover/symbolic execution engine comes up against an impossible task whenever it needs to address into mappings or arrays. For example, given a symbolic value a to look up in myArray it will look at the current storage and for each key K ask, essentially: chop(keccak(7) + a+1) == K. Of course, this is absolutely satisfiable! It is just very unlikely to be true for any given K, or even possible, since a will be constrained by the sice of the array. Now, what if we ask chop(keccak(7) + a + 1) == chop(keccak(8) + b + 1)? An underlying assumption of the compiler is that this is always false, but it is certainly not necessary that it is mathematically, and a naive approach to addressing into storage in KEVM will simply check for equality between what is currently beeing addressed and any existing keys in the storage, leading to absolute explosion.

The current appraoch

@PetarMax has had success by adding certain lemmas that are mathematically unsound but perfectly safe in most any Solidity code written in good fatih and compiled with a Solidity version from the last couple of years. Here are the lemmas (paraphrased):

keccak(X) == Y => false [concrete(Y), symbolic(X), simplification]
keccak(X) == keccak(Y) => X == Y

The approach has been to simply log all instances of these rules applying, allowing for manual inspection, rather than let the prover branch on all possible comparisons between a symbolic keccak result and known storage slots (which may themselves be symbolic keccak results).

Leaving keccak uncomputed

A first approach could be to simply not compute keccaks in the symbolic backend. This increases legibility. If there are pre-loaded values in storage (say, from downloading an actual on-chain contract's storage and loading that), we could still allow computing keccak results when fully concrete, only for comparison:

keccak(X) == Y => true requires keccak_actually_compute_it(X) == Y [concrete(X), concrete(Y), simplificaiton]

un-keccak

If we are loading on-chain data we would be well-served by writing a tool that does it's best to identify the keccak computations which generated those keys. This is historically available. It may not be trivial to find, but once identified it is trivial to keep the information around. Invocations of Kontrol would then, at the relevant point in time, go over storage and code and convert any instances of the known integers to their keccak representations.

Again, this would increase legibility, and allow us to use more intellignet introspection. For example, keccak((SOME_ADDRESS), 9) == keccak(7) + 3 would be more recognizable as false and might even be possible to discharge this equality with a suitable lemma.

`assumption`

We are currently mixing two concepts in a very unappealing way: we are using lemmas (which should be sound, provable, helpful simplifications) which domain-level simplifying assumptions that are certainly not always true. This seems to create friction and requires great care. It is also not very future-proof: one could imagine a more intelligent SMT solver -- perhaps with an artificial general intelligence coming up with wondorous mathematical proofs to aid it -- seeing that we have defined a function, keccak with an infinte domain and a finite-codomain:

keccak :: Byte[] -> Uint256

Then we assert that this function is injective (see the lemmas in "Current Approach"):

keccak(X) == keccak(Y) => X == Y

It it trivial to apply the pigeon-hole principle: a function can not possibly injective if its codomain is strictly smaller than its domain, and thus it must be the case that there is some collission. Indeed, all hash functions have collissions, even if they are collission reistant. Thus, an intelligen enough solver could deduce that all possible formulas containing the assertion that keccak is injective are unsat, making the solver worthless simply by virtue of being 100% correct, and more capable than z3 is today.

It seems inherently unwise to allow unsound lemmas, even if they are helpful in the short-term. What we are looking for here is really a way to add simplifying assumptions in a schematic way. And rather than automatically comingle these assumptions with aboslute truths, we should take them as a computational convenience: "this might well be a satisfiable condition, but we shall not waste computational resources on exploring it". So for example, encountering keccak((SOME_ADDRESS), 9) == keccak(7) + 3 we should not choose to label it unsatisfiable, but rather mark it as ignored for now.

A sensible way to do this could be introducing a new type of lemma, let's call it assumption. We would write:

keccak(X) ==K keccak(Y) => X ==Bytes Y [assumption]

This would mean that the backend would be free to apply this as a simplification, but report a branch: the result being one leaf with the assumption applied, and another which is marked as its counterfactual: #Not( keccak(X) == keccak(Y) => X == Y). The exact way this labeling happens is less important, because the function of the assumption is to stop evaluating: the resulting leaf where the assumption does not hold is not marked as impossible, it simply is not computed further but is left in the KCFG. A flag in Kontrol could be used to indicate whether simplifying assumptions should be ignored -- the proof goes through if it can be discharged on all branches on which the assumption holds. Other use cases, which are more about searching the KCFG for possible executions, would not bother with deciding whether a spec holds or not, and would still have access to them in the KCFG for inspection.

This would make it easier for provers to add assumptions like these without fear of introducing dangerous unsoundness, and the resulting branches would be readibly possible to locate in the KCFG.

ehildenb commented 8 months ago

This proposal looks great! I think it could work out quite nicely. Moving back to Triage to make sure the team discusses it (feel free to re-backlog it if needed).

ehildenb commented 8 months ago

Also, I think we'll find we don't need the assumption keyword, but maybe it will still be helpful.

palinatolmach commented 8 months ago

Thanks @ehildenb! We had a discussion yesterday and decided that we'll address the "leaving keccak uncomputed" from the Kontrol side early next year, while @geo2a will look into the assumption section when he has time.

PetarMax commented 8 months ago

There is this rule in the EVM semantics:

rule [keccak]: keccak(WS) => #parseHexWord(Keccak256bytes(WS)) [concrete]

which, if disabled for symbolic execution, could possibly reveal the concrete keccaks. Hopefully, the compiler doesn't compute them on its own.

runtimeverification / kontrol