runtimeverification / hs-backend-booster

Accelerates K Framework's Haskell backend
BSD 3-Clause "New" or "Revised" License
7 stars 0 forks source link

New symbolic rewriter in haskell backend #3

Closed ehildenb closed 1 year ago

ehildenb commented 1 year ago

We need a new symbolic rewriter to take care of "bulk" execution for semantics: execution that is not interesting and is just shuffling around state and checking simple conditions.

As an experiment, on branch _ehildenb of KEVM, there is a new rewriter implemented which has a much simpler algorithm for rewriting and can handle most of KEVM execution (>95% of K execution steps, 6X faster than Haskell backend). We should implement the same in Haskell, and put it in front of the current backend. The algorithm is here in Python. It needs some adjustments to be sound (which are included in this document). The current version on that branch is much slower because it's using the textual interface for Kore to communicate with the LLVM backend every rewrite step (6-8s roundtrip time), but the actual time spent in the LLVM backend is 40-100ms per query.

The point of this algorithm is not to be clever or good at difficult symbolic execution, it's to be super fast at easy symbolic execution. Because we already have a general symbolic execution engine, we can always fallback to that one. So this algorithm is designed to be as fast as possible in the "happy path" (which is the >95% case), but to be able to detect when it is not sufficient, so it can gracefully fall back to the original path for those harder steps. One of the key observations is that we do not do any simplification that is not needed. In particular, we avoid ever having to do the following (delegating that to the more powerful engine sometimes but not every step):

However, branching is also expensive. Each branch adds another path of execution we have to explore. For now, we will say that every time this simple rewriter cannot prune a branch point, we will ask the original simplifier to prune it for us. If the original simplifier cannot prune one of the branches, then it is a proper branch point.

Then we can study the cases that we see the backend having to fallback to the original rewriter for, and make small generalizations to our rewriter that allow it to make progress in slightly more cases. This algorithm here should be able to make it roughly an opcode worth of execution in KEVM without having to fallback to the original rewriter, maybe more.

Later steps to get more aggressive performance improvements (note that these were not needed for performance of the pyk engine, except for K cell simplification):

ehildenb commented 1 year ago

What is "bulk execution"?

Here is an example execution using the pyk rewriter:

image

It goes back and forth between using the pyk rewriter and the haskell backend. So it takes 62 steps with pyk rewriter, then 1 step with haskell backend, then 44 steps with pyk rewriter, then 1 step haskell backend, etc...

ehildenb commented 1 year ago

The effect of rule indexing

Here is an example where the rule indexer chosen for KEVM does really well, and gives back one rule which could apply (< 3ms to pick a rule and apply it):

image

And an example where it does less good, and gives back 88 rules which could apply (~70 ms to pick a rule and apply it):

image

And here is a rule that the index fails on, and gives back * (~ 300ms):

image

Rule indexing has a massive impact on performance. The key is to have an index function that is very fast to compute, but also narrows down the possible rules that could apply drastically.

All three rules shown here apply every opcode execution cycle, but take drastically different amounts of time. Both of the bad indexes (the ones with 88 rules and 360 rules) could be improved by inspecting one slightly different subterm than "first constructor at the top of the K cell". In the case of the #gasExec(_,_) operator, you could inspect the constructor in the second argument position of the #gasExec(_, _) constructor, and in the case of the #deductGas, you could inspect the constructor at the top of the second item of the KSequence. Ideally we would have a generic way of making rule indexes that was somehow optimal as well.

ehildenb commented 1 year ago

Constraint Solving

Each step of execution, you may generate new constraints in the form of requires and ensures clauses. These constraints are usually of the form of boolean predicates about components of your state. When rules apply, the constraints are instantiated, and need to be checked if they go to false. The more states we can send to false, the fewer branches we need to explore. But constraint solving can also get very involved and expensive quickly, for bulk execution we want heuristics that work very frequently but are fast.

Here is an example where only one rule ends up applying (rule indexing isn't very good), and no constraint solving is needed (of the first 60 steps, 34 of them are this case, where no constraint solving is needed):

image

Here is an example where only 3 rules are in the index, but each has a side-condition that we must invoke the LLVM backend for (we make 99 calls to LLVM backend in the first 62 steps, which can probably be reduced somewhat):

image

Each invocation of the LLVM backend takes ~110ms, and accounts for about ~10s of the 17s of execution for the first 62 steps of this execution. For each query, ~60ms of time is spent converting between KAst <-> Kore data structures in Python and loading/unloading the LLVM backend. Here is what a call to the LLVM backend looks like in detail:

image

Between the first and second line, 17ms elapses and this is where KAst -> Kore happens (among other things). From there, we enter the krun Bash script, which makes several quick noop calls, spends 40ms in llvm-krun actually computing, and some more noop calls. The overall call to krun bash script took 108ms. Between the line "Completed" and "Simplified" towards the end, the conversion Kore -> KAst happens, which takes ~14ms. Overall the process takes 137ms.

I hope we can get this process down to 60ms, and have it be pretty much constant time, with FFI to the LLVM backend. It's much longer to call the constraint solver than it is to do rule matching, for example. There is a hard limit on how fast the backend can ever be with this approach, because it seems that the llvm backend takes ~40ms no matter what. In addition, many of these constraints can be solved very quickly with a single rewrite rule which likely can be applied on the order on ~1ms (not all constraints though, some trigger massive data-structure manipulations).

Having the LLVM backend FFI means we can have a simplification routine with fast performance, simple design, and not much work to integrate it. One strategy we could adopt later is this (for example):

The theory being that user-supplied simplification rules should be reducing to true/false very quickly (a few rule applications), and if they are not it's because we are doing some larger data-structure manipulation that the LLVM backend can handle better.

ehildenb commented 1 year ago

Improvements to rule indexing

I improved the rule indexing as this comment describes at the end, and it shaved 3s off the overall time. So now the Python rewriter is spending ~14s to do 66 K steps, with ~10s of that spent in queries to the LLVM backend. For reference, the current haskell backend takes ~24s for the same rewrite steps.

ana-pantilie commented 1 year ago

Glossary:

  1. LLVM FFI => might be easier to do this for a PoC, instead of writing a concrete simplifier right now Requirements:
    • use unboxed, basic, types, instead of any kind of higher level maps or sets Other uses:
    • potential FFI between Python and Haskell, could replace the RPC server in the future
  2. We need to re-design the basic types (TermLike, Internal.Pattern), keeping in mind that we want fast FFI:
    • not requiring term traversals for figuring out if terms are concrete
    • clean separation between configuration and constraints in patterns
  3. Instead of haskell-kompile, start with a load procedure which does definition initialization and pre-processing (rule indexing, preserves definedness checks etc.) which can be moved later on into a separate haskell-kompile binary which will be called by the frontend. This way we can focus our efforts on having a functional new execution engine quickly without the architectural distractions.
  4. Rule indexing:
    • implement Everett's suggestion of a simple rule-indexing algorithm
    • architect this in such a way so that it's easy to extend/replace it with something else later
  5. Unification:
    • implement Everett's suggestion of a simple unification algorithm
    • architect this in such a way so that it's easy to extend/replace it with something else later
  6. New rewriting algorithm:
    • "If there are multiple remaining states (branching), call the original haskell backend simplification on the states to see if any go to #Bottom": is this enough? Requirements:
    • integrate old Haskell backend into the new Haskell backend (how?); should we start completely decoupled using the RPC server to interact with the old Haskell backend?

Conclusions:

ehildenb commented 1 year ago

Some responses:

Also, I want to make some notes about the "philosophy" here. The entire idea behind having a fast symbolic transition system (symbolic execution), is that you need to figure out which transitions could apply as fast as possible, and you need to eliminate potential transitions as fast as possible. There are many reasons a transition could not apply, this algorithm emphasizes doing the cheapest and most frequently succeeding checks early. In particular, you could have an algorithm that instead did:

This would be closer to what the current backend does, which is: "finish all algorithms to completion".

But it is often the case that, for example, AC unification is not required because one of the side-conditions in the 3rd step goes trivially to false with a single rewrite rule. So the above algorithm attempts to re-arrange the state pruning checks a bit:

This is a spectrum we can tune, we have knobs we can control. So if we make the new rewrite engine in front of the current one, and we only do the rule indexing and the simple fast unification first, the above statistics show that that would be enough for roughly 50% of the rewrite steps we take anyway! So it's easy to build this incrementally when we build it on top of the current engine, because we can always fall back to the current engine for (i) state pruning, and (ii) a single step of symbolic execution.

This is the reason I'd argue for tight integration with the current backend. We can lean on it and make it incrementally handle more cases (as we profile and check that it's even necessary). If we go and write our own, we will need to handle 100% of the rules of KEVM before we can benefit from performance improvements.

ehildenb commented 1 year ago

Here are the requirements:

I've gone through the implementation steps above again, and made the following adjustments:

Some of the tasks are parallelizable. We need to figure out how to assign and implement them to achieve this goal by Holiday closure.

ana-pantilie commented 1 year ago

Questions:

  1. New repository
  2. New calls old:
    • spinning up the old RPC server and using that to call into the old backend (cons: it's risky because we're not using this codepath when doing proving currently)
    • use the old Haskell backend as a library (cons: keep the initialized definition on the side as state, pulling in the dependencies of the old codebase directly) (pros: packaging will be easier)
    • this means reimplementing kore-exec
  3. Old calls new:
    • we'd have to have the new prover have its own state and the old one would have to keep it, so more changes to the old codebase
    • we've already spent a lot of time packaging the old codebase nicely
    • ~it will have to make the new one public if we use the new one as a library~
    • we could use the rpc protocol to call the new one, this has some overhead because it needs setting up
    • will the new codebase have to support branching?
  4. kore-exec --prove is implemented in pyk:
    • new backend implements execute endpoint
    • pyk handles the logic which calls the new backend and falls back to the old one

Observations:

We can work on the following right now (without deciding on how we integrate with the old backend), in order:

  1. design new datatypes (https://github.com/runtimeverification/hs-backend-booster/issues/4)
  2. port the kore parser to our new datatypes
  3. design effect stack
  4. after 2., we can start on the (in no particular order):
    • load procedure (which does rule indexing, preserves-definedness)
    • the simple unifier
    • FFI
  5. put it all together and write tests!

Communicating via RPC: easier to isolate issues/behaviors and debug Long-term:

Either way we'd want an rpc server for the new backend.

execute, simplify and check-implication need to be separate!

New backend execute should support cut patterns. We can use the simple unifier here, because the RHS of any realistic claims will have some free constructor that is its rule index (marks that this is a terminal state). This way, we can avoid having to implement check-implication now.

ehildenb commented 1 year ago

What do we need an effect stack for? Can we try to avoid getting into the polymorphic/monomorphic mess that we got into for prior backend? The Monad stacks seemed to be a source of those problems.

ana-pantilie commented 1 year ago

What do we need an effect stack for?

That's just the way you write meaningful Haskell programs. We can avoid the polymorphic/monomorphic mess by not using MTL.

ana-pantilie commented 1 year ago

Make a json-rpc server, with a single endpoint {"command": "execute", "cut-patterns": { CTerms } , "depth" : -1}, which matches the existing execute endpoint, but has an additional failure state that can be returned.

@ehildenb it would be much easier for us to support cut-rules instead of cut-patterns (like the old backend does). Do you think cut-rules would suffice (at least for this initial implementation)?

ehildenb commented 1 year ago

I think we should probably strive to support cut-patterns instead of cut-rules, because I think that will be more useful longterm, but cut-rules also works fine for now.

Note that the cut-patterns we intend to use are all indexable (with the rule-index), so the vast majority of times all you'll need to do to make sure that the cut pattern doesn't match is to make sure it has a different rule index from the current term. For KEVM and IMP at least this will be the case, I don't know about other languages (I think it's likely though).

I don't think it makes a big difference for V1, but if you guys think it will be easier, let's do cut-rules.

ana-pantilie commented 1 year ago

How does the frontend translate X ~> foo(Y) ...

ehildenb commented 1 year ago

Kast: KSequence([KVariable(X), KApply('foo', [KVariable(Y)]), KVariable('_Gen0')])

Which in Kore: kseq(EVar('VarX', SortApp('SortKItem')), kseq(App('foo', [], [EVar('VarY', SortApp('SortBar'))]), EVar("Var'Unds'Gen0", SortApp('K'))])])

ehildenb commented 1 year ago

See: https://github.com/runtimeverification/pyk/blob/master/src/pyk/ktool/kprint.py#L364 (Kast -> Kore) and https://github.com/runtimeverification/pyk/blob/master/src/pyk/ktool/kprint.py#L286 (Kore -> Kast).