Closed ehildenb closed 1 year ago
Here is an example execution using the pyk rewriter:
It goes back and forth between using the pyk rewriter and the haskell backend. So it takes 62 steps with pyk rewriter, then 1 step with haskell backend, then 44 steps with pyk rewriter, then 1 step haskell backend, etc...
Here is an example where the rule indexer chosen for KEVM does really well, and gives back one rule which could apply (< 3ms to pick a rule and apply it):
And an example where it does less good, and gives back 88 rules which could apply (~70 ms to pick a rule and apply it):
And here is a rule that the index fails on, and gives back *
(~ 300ms):
Rule indexing has a massive impact on performance. The key is to have an index function that is very fast to compute, but also narrows down the possible rules that could apply drastically.
All three rules shown here apply every opcode execution cycle, but take drastically different amounts of time. Both of the bad indexes (the ones with 88 rules and 360 rules) could be improved by inspecting one slightly different subterm than "first constructor at the top of the K cell". In the case of the #gasExec(_,_)
operator, you could inspect the constructor in the second argument position of the #gasExec(_, _)
constructor, and in the case of the #deductGas
, you could inspect the constructor at the top of the second item of the KSequence. Ideally we would have a generic way of making rule indexes that was somehow optimal as well.
Each step of execution, you may generate new constraints in the form of requires
and ensures
clauses. These constraints are usually of the form of boolean predicates about components of your state. When rules apply, the constraints are instantiated, and need to be checked if they go to false
. The more states we can send to false
, the fewer branches we need to explore. But constraint solving can also get very involved and expensive quickly, for bulk execution we want heuristics that work very frequently but are fast.
Here is an example where only one rule ends up applying (rule indexing isn't very good), and no constraint solving is needed (of the first 60 steps, 34 of them are this case, where no constraint solving is needed):
Here is an example where only 3 rules are in the index, but each has a side-condition that we must invoke the LLVM backend for (we make 99 calls to LLVM backend in the first 62 steps, which can probably be reduced somewhat):
Each invocation of the LLVM backend takes ~110ms, and accounts for about ~10s of the 17s of execution for the first 62 steps of this execution. For each query, ~60ms of time is spent converting between KAst <-> Kore data structures in Python and loading/unloading the LLVM backend. Here is what a call to the LLVM backend looks like in detail:
Between the first and second line, 17ms elapses and this is where KAst -> Kore happens (among other things). From there, we enter the krun
Bash script, which makes several quick noop calls, spends 40ms in llvm-krun
actually computing, and some more noop calls. The overall call to krun
bash script took 108ms. Between the line "Completed" and "Simplified" towards the end, the conversion Kore -> KAst happens, which takes ~14ms. Overall the process takes 137ms.
I hope we can get this process down to 60ms, and have it be pretty much constant time, with FFI to the LLVM backend. It's much longer to call the constraint solver than it is to do rule matching, for example. There is a hard limit on how fast the backend can ever be with this approach, because it seems that the llvm backend takes ~40ms no matter what. In addition, many of these constraints can be solved very quickly with a single rewrite rule which likely can be applied on the order on ~1ms (not all constraints though, some trigger massive data-structure manipulations).
Having the LLVM backend FFI means we can have a simplification routine with fast performance, simple design, and not much work to integrate it. One strategy we could adopt later is this (for example):
The theory being that user-supplied simplification rules should be reducing to true/false
very quickly (a few rule applications), and if they are not it's because we are doing some larger data-structure manipulation that the LLVM backend can handle better.
I improved the rule indexing as this comment describes at the end, and it shaved 3s off the overall time. So now the Python rewriter is spending ~14s to do 66 K steps, with ~10s of that spent in queries to the LLVM backend. For reference, the current haskell backend takes ~24s for the same rewrite steps.
Glossary:
haskell-kompile
, start with a load
procedure which does definition initialization and pre-processing (rule indexing, preserves definedness checks etc.) which can be moved later on into a separate haskell-kompile
binary which will be called by the frontend. This way we can focus our efforts on having a functional new execution engine quickly without the architectural distractions.Conclusions:
Some responses:
#computeValidJumpDests
in KEVM, which are always fully concrete and always massive data-structure manipulations (no reason to do this with rewriting).Also, I want to make some notes about the "philosophy" here. The entire idea behind having a fast symbolic transition system (symbolic execution), is that you need to figure out which transitions could apply as fast as possible, and you need to eliminate potential transitions as fast as possible. There are many reasons a transition could not apply, this algorithm emphasizes doing the cheapest and most frequently succeeding checks early. In particular, you could have an algorithm that instead did:
This would be closer to what the current backend does, which is: "finish all algorithms to completion".
But it is often the case that, for example, AC unification is not required because one of the side-conditions in the 3rd step goes trivially to false with a single rewrite rule. So the above algorithm attempts to re-arrange the state pruning checks a bit:
This is a spectrum we can tune, we have knobs we can control. So if we make the new rewrite engine in front of the current one, and we only do the rule indexing and the simple fast unification first, the above statistics show that that would be enough for roughly 50% of the rewrite steps we take anyway! So it's easy to build this incrementally when we build it on top of the current engine, because we can always fall back to the current engine for (i) state pruning, and (ii) a single step of symbolic execution.
This is the reason I'd argue for tight integration with the current backend. We can lean on it and make it incrementally handle more cases (as we profile and check that it's even necessary). If we go and write our own, we will need to handle 100% of the rules of KEVM before we can benefit from performance improvements.
Here are the requirements:
I've gone through the implementation steps above again, and made the following adjustments:
Some of the tasks are parallelizable. We need to figure out how to assign and implement them to achieve this goal by Holiday closure.
Questions:
kore-exec --prove
kore-exec
kore-exec --prove
is implemented in pyk
:
Observations:
We can work on the following right now (without deciding on how we integrate with the old backend), in order:
load
procedure (which does rule indexing, preserves-definedness)Communicating via RPC: easier to isolate issues/behaviors and debug Long-term:
Either way we'd want an rpc server for the new backend.
execute, simplify and check-implication need to be separate!
New backend execute should support cut patterns. We can use the simple unifier here, because the RHS of any realistic claims will have some free constructor that is its rule index (marks that this is a terminal state). This way, we can avoid having to implement check-implication now.
kore-exec --prove
. Let's not worry about RPC for the moment.What do we need an effect stack for? Can we try to avoid getting into the polymorphic/monomorphic mess that we got into for prior backend? The Monad stacks seemed to be a source of those problems.
What do we need an effect stack for?
That's just the way you write meaningful Haskell programs. We can avoid the polymorphic/monomorphic mess by not using MTL.
Make a json-rpc server, with a single endpoint {"command": "execute", "cut-patterns": { CTerms } , "depth" : -1}, which matches the existing execute endpoint, but has an additional failure state that can be returned.
@ehildenb it would be much easier for us to support cut-rules
instead of cut-patterns
(like the old backend does). Do you think cut-rules
would suffice (at least for this initial implementation)?
I think we should probably strive to support cut-patterns instead of cut-rules, because I think that will be more useful longterm, but cut-rules also works fine for now.
Note that the cut-patterns we intend to use are all indexable (with the rule-index), so the vast majority of times all you'll need to do to make sure that the cut pattern doesn't match is to make sure it has a different rule index from the current term. For KEVM and IMP at least this will be the case, I don't know about other languages (I think it's likely though).
I don't think it makes a big difference for V1, but if you guys think it will be easier, let's do cut-rules
.
How does the frontend translate
Kast: KSequence([KVariable(X), KApply('foo', [KVariable(Y)]), KVariable('_Gen0')])
Which in Kore: kseq(EVar('VarX', SortApp('SortKItem')), kseq(App('foo', [], [EVar('VarY', SortApp('SortBar'))]), EVar("Var'Unds'Gen0", SortApp('K'))])])
We need a new symbolic rewriter to take care of "bulk" execution for semantics: execution that is not interesting and is just shuffling around state and checking simple conditions.
As an experiment, on branch
_ehildenb
of KEVM, there is a new rewriter implemented which has a much simpler algorithm for rewriting and can handle most of KEVM execution (>95% of K execution steps, 6X faster than Haskell backend). We should implement the same in Haskell, and put it in front of the current backend. The algorithm is here in Python. It needs some adjustments to be sound (which are included in this document). The current version on that branch is much slower because it's using the textual interface for Kore to communicate with the LLVM backend every rewrite step (6-8s roundtrip time), but the actual time spent in the LLVM backend is 40-100ms per query.The point of this algorithm is not to be clever or good at difficult symbolic execution, it's to be super fast at easy symbolic execution. Because we already have a general symbolic execution engine, we can always fallback to that one. So this algorithm is designed to be as fast as possible in the "happy path" (which is the >95% case), but to be able to detect when it is not sufficient, so it can gracefully fall back to the original path for those harder steps. One of the key observations is that we do not do any simplification that is not needed. In particular, we avoid ever having to do the following (delegating that to the more powerful engine sometimes but not every step):
However, branching is also expensive. Each branch adds another path of execution we have to explore. For now, we will say that every time this simple rewriter cannot prune a branch point, we will ask the original simplifier to prune it for us. If the original simplifier cannot prune one of the branches, then it is a proper branch point.
Then we can study the cases that we see the backend having to fallback to the original rewriter for, and make small generalizations to our rewriter that allow it to make progress in slightly more cases. This algorithm here should be able to make it roughly an opcode worth of execution in KEVM without having to fallback to the original rewriter, maybe more.
CTerm
,Config
, andPredicate
(aCTerm
is aConfig
with a list ofPredicate
, and aConfig
is aTerm
of sortGeneratedTopCell
).rewrite_step : CTerm -> Optional[Set[CTerm]]
with initial implementation_ => None
.{"command": "execute", "cut-patterns": { CTerms } , "depth" : -1}
, which matches the existingexecute
endpoint, but has an additionalfailure
state that can be returned.preserves-definedness
as a definition load-time pass. https://github.com/runtimeverification/haskell-backend/issues/3373. Implement this simple check here directly: https://github.com/runtimeverification/k/pull/2963 (basically "free constructors + total functions are always defined").{ T1 #Equals T2 }
(or port existing one). It should:#Bottom
. This is the only#Bottom
case it should try to detect.{ SUBTERM1 #Equals SUBTERM2 }
. Do not do any more work on them yet, and do not attempt to simplify them.<k>
cell (if present), or else (ii)*
(all rules). For example, iffoo
is a free constructor,SetItem
is a free constructor, andbar
is a function, and_Set_
is a non-free constructor, then:<k> foo(X) ... </k>
isfoo
,<k> bar(X) ... </k>
is*
,<k> SetItem(foo(X)) SetItem(foo(X)) ... </k>
is*
,<k> SetItem(bar(X)) ... </k>
isSetItem
, and<k> X ~> foo(Y) ... </k>
is*
.T1
andT2
, ifrule_index(T1) =/= * and rule_index(T2) =/= * and rule_index(T1) =/= rule_index(T2)
, then{ T1 #Equals T2 }
is#Bottom
. That is, the rule index forms equivalence classes of terms that could unify with each other.simplify : CTerm -> CTerm
andsimplify_bool : Predicate -> Predicate
. Initial implementation of both is identity function.I = rule_index(T)
.LHS_T /\ LHS_P => RHS_T /\ RHS_P
in the rule index forI
and the rule index for*
, computealpha = { LHS_T #Equals T }
using the fast simple unification algorithm above.alpha
is#Bottom
, reject this rule.alpha
is not a matching substitution forLHS_T
, then fail. A matching substitution for a patternQ
is a conjunct of equalities{ V1 #Equals T1 } #And ... #And { VN #Equals TN }
, where theVi
are all element variables from patternQ
.C
in the predicatesLHS_P /\ RHS_P
(not any of the existing constraints), computeC_NEW = alpha(C)
(apply the substitution). Callsimplify_bool(C_NEW)
, and if it'sfalse
prune the state.preserves-definedness
attribute, then fail.simplify
on them. If any go to#Bottom
, remove them from the list of successors.kup
prove_cfg
method as logic over the current haskell backend RPC.prove_cfg
method.Later steps to get more aggressive performance improvements (note that these were not needed for performance of the pyk engine, except for K cell simplification):
haskell-kompile
binary and do "static" computations ahead of time once instead of each proof (rule indexing and definedness preservation).I_GOAL
(for prover mode). IfI_GOAL == *
, print a very loud warning to the user about how they should pick a better RHS or complain to us about the poor quality of our rule indexer. Only do implication checking if the current terms rule index matchesI_GOAL
, or ifI_GOAL == *
.T_i
that is either theLHS
orRHS
of a rule, pairwise analyze the critical pairs{ T_i #Equals T_j }
for everyi =/= j
. Create equivalence classes of terms by whether whether they "could unify", that is whether{ T_i #Equals T_j } => #Bottom
. For example, if{ T_1 #Equals T_2 }
is not#Bottom
, and{ T_2 #Equals T_3 }
is also not#Bottom
, then all three ofT1
,T2
, andT3
should be in the same equivalence class. Here, we want a powerful unification algorithm that can say more things go to#Bottom
, to have smaller equivalence classes (run at kompile time, OK to be slower).#RuleIndex
, and add a cell<ruleIndex> .RuleIndex </ruleIndex>
.#RuleIndex
.rule LHS => RHS
, add the cell<ruleIndex> ruleIndex(LHS) => ruleIndex(RHS) </ruleIndex>
whereruleIndex(D)
is the new constructor added to sort#RuleIndex
for the equivalence classD
is in.<k>
cell.kprove
or viakrun
orexecute
endpoint), start the<ruleIndex>
cell as a fresh variable. This will trigger unification with all the potential rules for the first step of execution, but every subsequent step will be able to use the rule indexing.<k>
cell initially, because so many rules are of the form<k> SOMETHING => . ... </k>
, so the RHS of these rules will look like<k> _VAR </k>
, which will unify with nearly every other rule lhs or rhs in the definition. We need to combine this with rule follow-set analysis, because many semantics have<k> INITIAL => STEP1 ~> STEP2 ~> STEP3 ... </k>
and rules like<k> STEP1 => . ... </k>
,<k> STEP2 => . ... </k>
, and<k> STEP3 => . ... </k>
. So we need it so that when<k> INITIAL ... </k>
first comes onto the<k>
cell, the rule index indicates that onlySTEP1
,STEP2
, andSTEP3
could follow.simplification
rule.