New symbolic rewriter in haskell backend

We need a new symbolic rewriter to take care of "bulk" execution for semantics: execution that is not interesting and is just shuffling around state and checking simple conditions.

As an experiment, on branch _ehildenb of KEVM, there is a new rewriter implemented which has a much simpler algorithm for rewriting and can handle most of KEVM execution (>95% of K execution steps, 6X faster than Haskell backend). We should implement the same in Haskell, and put it in front of the current backend. The algorithm is here in Python. It needs some adjustments to be sound (which are included in this document). The current version on that branch is much slower because it's using the textual interface for Kore to communicate with the LLVM backend every rewrite step (6-8s roundtrip time), but the actual time spent in the LLVM backend is 40-100ms per query.

The point of this algorithm is not to be clever or good at difficult symbolic execution, it's to be super fast at easy symbolic execution. Because we already have a general symbolic execution engine, we can always fallback to that one. So this algorithm is designed to be as fast as possible in the "happy path" (which is the >95% case), but to be able to detect when it is not sufficient, so it can gracefully fall back to the original path for those harder steps. One of the key observations is that we do not do any simplification that is not needed. In particular, we avoid ever having to do the following (delegating that to the more powerful engine sometimes but not every step):

Do not handle complicated unification problems (most K rules involve very simple constructor unification, or can be adjusted to do so).
Do not do pattern-wide simplifications (term traversals are expensive, and usually simplification will invoke multiple traversals).
Do not do contextual simplifications.
Do not do symbolic simplifications of constraints.

However, branching is also expensive. Each branch adds another path of execution we have to explore. For now, we will say that every time this simple rewriter cannot prune a branch point, we will ask the original simplifier to prune it for us. If the original simplifier cannot prune one of the branches, then it is a proper branch point.

Then we can study the cases that we see the backend having to fallback to the original rewriter for, and make small generalizations to our rewriter that allow it to make progress in slightly more cases. This algorithm here should be able to make it roughly an opcode worth of execution in KEVM without having to fallback to the original rewriter, maybe more.

[x] (1 - 2 weeks): Implement new Kore data-structures, or adjust existing ones.
- No auto-simplification turned on, just term construction.
- Supports the notions of CTerm, Config, and Predicate (a CTerm is a Config with a list of Predicate, and a Config is a Term of sort GeneratedTopCell).
- Support parsing/unparsing textual Kore into these datastructures.
- Need to be able to parse definitions into these datastructures (same definition.kore fed into both backends).
[x] (2 - 1 week) Make a stub rewriter for integration:
- Make a new function rewrite_step : CTerm -> Optional[Set[CTerm]] with initial implementation _ => None.
- Make a json-rpc server, with a single endpoint {"command": "execute", "cut-patterns": { CTerms } , "depth" : -1}, which matches the existing execute endpoint, but has an additional failure state that can be returned.
[x] (3 - 1 week): Implement preserves-definedness as a definition load-time pass. https://github.com/runtimeverification/haskell-backend/issues/3373. Implement this simple check here directly: https://github.com/runtimeverification/k/pull/2963 (basically "free constructors + total functions are always defined").
[x] (4 - 1 week): Implement new standalone constructor unification algorithm { T1 #Equals T2 } (or port existing one). It should:
- Only attempt simple constructor unification. Nothing fancy with non-free constructors (no AC for example).
- If two free constructors differ, it should immediately return #Bottom. This is the only #Bottom case it should try to detect.
- If it's unclear whether two subterms unify (for example you hit a function symbol on one of the terms, or a non-free constructor like an AC symbol), return them as a remaining unification problems { SUBTERM1 #Equals SUBTERM2 }. Do not do any more work on them yet, and do not attempt to simplify them.
- Speed is much more important than completeness. Do not try to make it answer any more queries than just free constructor unification.
[x] #43
- Initial rule index we can use is: (i) the label of the free constructor in the first position of the <k> cell (if present), or else (ii) * (all rules). For example, if foo is a free constructor, SetItem is a free constructor, and bar is a function, and _Set_ is a non-free constructor, then:
- the rule index of <k> foo(X) ... </k> is foo,
- the rule index of <k> bar(X) ... </k> is *,
- the rule index of <k> SetItem(foo(X)) SetItem(foo(X)) ... </k> is *,
- the rule index of <k> SetItem(bar(X)) ... </k> is SetItem, and
- the rule index of <k> X ~> foo(Y) ... </k> is *.
- The rule index needs the following properties:
- It is fast to compute for any term of the semantics configuration sort.
- For two terms T1 and T2, if rule_index(T1) =/= * and rule_index(T2) =/= * and rule_index(T1) =/= rule_index(T2), then { T1 #Equals T2 } is #Bottom. That is, the rule index forms equivalence classes of terms that could unify with each other.
[x] (6 - 1/2 week) Implement simplify : CTerm -> CTerm and simplify_bool : Predicate -> Predicate. Initial implementation of both is identity function.
[x] #44
- Compute the rule index I = rule_index(T).
- For each rule LHS_T /\ LHS_P => RHS_T /\ RHS_P in the rule index for I and the rule index for *, compute alpha = { LHS_T #Equals T } using the fast simple unification algorithm above.
- If alpha is #Bottom, reject this rule.
- If alpha is not a matching substitution for LHS_T, then fail. A matching substitution for a pattern Q is a conjunct of equalities { V1 #Equals T1 } #And ... #And { VN #Equals TN }, where the Vi are all element variables from pattern Q.
- For each constraint in C in the predicates LHS_P /\ RHS_P (not any of the existing constraints), compute C_NEW = alpha(C) (apply the substitution). Call simplify_bool(C_NEW), and if it's false prune the state.
- For any remaining states that have not been pruned at this point, sort them by priority and only take rules in the highest priority bucket which has rules that applied.
- If there are any remaining rules that do not have the preserves-definedness attribute, then fail.
- For any remaining states, call simplify on them. If any go to #Bottom, remove them from the list of successors.
[x] (8 - 2 weeks): Write lots of tests (we should be writing the testing harness for each individual component as we go though, so it's easy to add tests at this point).
[x] (9): Add new backend to kup
[x] (10 - 3 weeks) Add pyk logic for handling the two backends:
- Invoke prove_cfg method as logic over the current haskell backend RPC.
- Port KEVM proofs over to using this new prove_cfg method.
- Add flag to pyk prover which invokes new backend instead.
[x] (11) Implement LLVM FFI for boolean simplification
[x] (12) Implement LLVM FFI for simplification of any concrete term
[x] (13 - 1 week) Add a concrete simplification step before rewrite rule application

Later steps to get more aggressive performance improvements (note that these were not needed for performance of the pyk engine, except for K cell simplification):

[ ] Pull out haskell-kompile binary and do "static" computations ahead of time once instead of each proof (rule indexing and definedness preservation).
[ ] (Prover mode): Compute the rule index of the RHS of the proof claim ahead of time, as I_GOAL (for prover mode). If I_GOAL == *, print a very loud warning to the user about how they should pick a better RHS or complain to us about the poor quality of our rule indexer. Only do implication checking if the current terms rule index matches I_GOAL, or if I_GOAL == *.
[x] Sort the rules in the returned rule index ahead of time by priority, and try them in order (so we are not trying lower priority rules if higher priority ones apply).
[ ] Improve rule indexing by analyzing critical pairs and rule follow-sets.
- For each term T_i that is either the LHS or RHS of a rule, pairwise analyze the critical pairs { T_i #Equals T_j } for every i =/= j. Create equivalence classes of terms by whether whether they "could unify", that is whether { T_i #Equals T_j } => #Bottom. For example, if { T_1 #Equals T_2 } is not #Bottom, and { T_2 #Equals T_3 } is also not #Bottom, then all three of T1, T2, and T3 should be in the same equivalence class. Here, we want a powerful unification algorithm that can say more things go to #Bottom, to have smaller equivalence classes (run at kompile time, OK to be slower).
- Modify the semantics in the following way:
- Add a new sort #RuleIndex, and add a cell <ruleIndex> .RuleIndex </ruleIndex>.
- For each equivalence class of critical pairs, add a new constructor to the sort #RuleIndex.
- To each rule LHS => RHS, add the cell <ruleIndex> ruleIndex(LHS) => ruleIndex(RHS) </ruleIndex> where ruleIndex(D) is the new constructor added to sort #RuleIndex for the equivalence class D is in.
- Modify the backend to use this cell's contents as the rule index, instead of the top of the <k> cell.
- For input terms (either via kprove or via krun or execute endpoint), start the <ruleIndex> cell as a fresh variable. This will trigger unification with all the potential rules for the first step of execution, but every subsequent step will be able to use the rule indexing.
- This will likely actually perform worse than the above heuristic based on the top of the <k> cell initially, because so many rules are of the form <k> SOMETHING => . ... </k>, so the RHS of these rules will look like <k> _VAR </k>, which will unify with nearly every other rule lhs or rhs in the definition. We need to combine this with rule follow-set analysis, because many semantics have <k> INITIAL => STEP1 ~> STEP2 ~> STEP3 ... </k> and rules like <k> STEP1 => . ... </k>, <k> STEP2 => . ... </k>, and <k> STEP3 => . ... </k>. So we need it so that when <k> INITIAL ... </k> first comes onto the <k> cell, the rule index indicates that only STEP1, STEP2, and STEP3 could follow.
[ ] Allow the user to supply a list of cells which should be fully simplified at each step of rewriting.
[ ] Make it an error for semantics to not preserve definedness on every semantic rule and every simplification rule.
[ ] Implement more unification handled in the fast case. Goal here is to introduce decision procedures that we know (i) terminate algorithmically in low time, and (ii) we commonly see being missed so it causes us to fallback to original simplifier. If it doesn't meet those two requirements, we do not implement that case.
[ ] Implement a new equational rewriter (very simple, but indexed), and add it to the simplifier used for simplifying each cell (after calling the concrete simplifier on concrete subterms), and for simplifying the entire state only at branch points before calling the original simplifier (no need to simplify things if it's not branching anyway).

What is "bulk execution"?

Here is an example execution using the pyk rewriter:

It goes back and forth between using the pyk rewriter and the haskell backend. So it takes 62 steps with pyk rewriter, then 1 step with haskell backend, then 44 steps with pyk rewriter, then 1 step haskell backend, etc...

The effect of rule indexing

Here is an example where the rule indexer chosen for KEVM does really well, and gives back one rule which could apply (< 3ms to pick a rule and apply it):

And an example where it does less good, and gives back 88 rules which could apply (~70 ms to pick a rule and apply it):

And here is a rule that the index fails on, and gives back * (~ 300ms):

Rule indexing has a massive impact on performance. The key is to have an index function that is very fast to compute, but also narrows down the possible rules that could apply drastically.

All three rules shown here apply every opcode execution cycle, but take drastically different amounts of time. Both of the bad indexes (the ones with 88 rules and 360 rules) could be improved by inspecting one slightly different subterm than "first constructor at the top of the K cell". In the case of the #gasExec(_,_) operator, you could inspect the constructor in the second argument position of the #gasExec(_, _) constructor, and in the case of the #deductGas, you could inspect the constructor at the top of the second item of the KSequence. Ideally we would have a generic way of making rule indexes that was somehow optimal as well.

Constraint Solving

Each step of execution, you may generate new constraints in the form of requires and ensures clauses. These constraints are usually of the form of boolean predicates about components of your state. When rules apply, the constraints are instantiated, and need to be checked if they go to false. The more states we can send to false, the fewer branches we need to explore. But constraint solving can also get very involved and expensive quickly, for bulk execution we want heuristics that work very frequently but are fast.

Here is an example where only one rule ends up applying (rule indexing isn't very good), and no constraint solving is needed (of the first 60 steps, 34 of them are this case, where no constraint solving is needed):

Here is an example where only 3 rules are in the index, but each has a side-condition that we must invoke the LLVM backend for (we make 99 calls to LLVM backend in the first 62 steps, which can probably be reduced somewhat):

Each invocation of the LLVM backend takes ~110ms, and accounts for about ~10s of the 17s of execution for the first 62 steps of this execution. For each query, ~60ms of time is spent converting between KAst <-> Kore data structures in Python and loading/unloading the LLVM backend. Here is what a call to the LLVM backend looks like in detail:

Between the first and second line, 17ms elapses and this is where KAst -> Kore happens (among other things). From there, we enter the krun Bash script, which makes several quick noop calls, spends 40ms in llvm-krun actually computing, and some more noop calls. The overall call to krun bash script took 108ms. Between the line "Completed" and "Simplified" towards the end, the conversion Kore -> KAst happens, which takes ~14ms. Overall the process takes 137ms.

I hope we can get this process down to 60ms, and have it be pretty much constant time, with FFI to the LLVM backend. It's much longer to call the constraint solver than it is to do rule matching, for example. There is a hard limit on how fast the backend can ever be with this approach, because it seems that the llvm backend takes ~40ms no matter what. In addition, many of these constraints can be solved very quickly with a single rewrite rule which likely can be applied on the order on ~1ms (not all constraints though, some trigger massive data-structure manipulations).

Having the LLVM backend FFI means we can have a simplification routine with fast performance, simple design, and not much work to integrate it. One strategy we could adopt later is this (for example):

Make a new equational rewriting engine (very simple indexed rewriter using the equational axioms).
Run the equational rewriter for 15ms.
If the term is not fully simplified at this point, invoke the LLVM backend on it.

The theory being that user-supplied simplification rules should be reducing to true/false very quickly (a few rule applications), and if they are not it's because we are doing some larger data-structure manipulation that the LLVM backend can handle better.

Improvements to rule indexing

I improved the rule indexing as this comment describes at the end, and it shaved 3s off the overall time. So now the Python rewriter is spending ~14s to do 66 K steps, with ~10s of that spent in queries to the LLVM backend. For reference, the current haskell backend takes ~24s for the same rewrite steps.

Glossary:

configuration = term (set of cells defined by the user or automatically), it contains functions, (free and non-free) constructors, variables, sort injections, sorts
constraints = side condition, predicate
pattern = (configuration, constraints); called CTerm in the Python implementation

LLVM FFI => might be easier to do this for a PoC, instead of writing a concrete simplifier right now Requirements:
- use unboxed, basic, types, instead of any kind of higher level maps or sets Other uses:
- potential FFI between Python and Haskell, could replace the RPC server in the future
We need to re-design the basic types (TermLike, Internal.Pattern), keeping in mind that we want fast FFI:
- not requiring term traversals for figuring out if terms are concrete
- clean separation between configuration and constraints in patterns
Instead of haskell-kompile, start with a load procedure which does definition initialization and pre-processing (rule indexing, preserves definedness checks etc.) which can be moved later on into a separate haskell-kompile binary which will be called by the frontend. This way we can focus our efforts on having a functional new execution engine quickly without the architectural distractions.
Rule indexing:
- implement Everett's suggestion of a simple rule-indexing algorithm
- architect this in such a way so that it's easy to extend/replace it with something else later
Unification:
- implement Everett's suggestion of a simple unification algorithm
- architect this in such a way so that it's easy to extend/replace it with something else later
New rewriting algorithm:
- "If there are multiple remaining states (branching), call the original haskell backend simplification on the states to see if any go to #Bottom": is this enough? Requirements:
- integrate old Haskell backend into the new Haskell backend (how?); should we start completely decoupled using the RPC server to interact with the old Haskell backend?

Conclusions:

separate repository for new Haskell backend
we need Kore JSON modules, the RPC server datatypes (but not the implementation of the endpoints), the Kore parser; maybe we should just copy them over
we will interact with the old HB through the RPC server executable; we can use the K release?

Some responses:

I'm in favor of decoupling the backend from the old one, emphasizing (though I disagree with a separate repo, or using the old backend over RPC, unless that can be created and integrated within a week):
- No committment to maintain the same interfaces/data-structures.
- Fast FFI to LLVM backend (there are some simplifications, like #computeValidJumpDests in KEVM, which are always fully concrete and always massive data-structure manipulations (no reason to do this with rewriting).
The LLVM backend FFI may be enough for all short term performance goals, so we should start there.
The unification algorithm as described above is able to be extended later. As long as when it cannot make progress it returns the remaining equalities, then it should be that you can continue and call some other unification procedure on the results. Unification, as an algorithm, has the nice property that it is incremental like this. Not sure if that's what you meant in your comment, but just wanted to make sure.
I'm a bit against doing this in a new repo, because I'd really like to have the faster engine in place by end of this year at the very latest. Somehow I think that developing in a new repo will be harder than developing in the current repo, then separating it later if it makes sense. Just lots of CI and other stuff we have to set up with new repo, and plugging it into the downstream codebases will also be more involved. If instead the goal is to just call the new rewriter by default instead of the current one, I think we will see faster integration by using the current repo (and we have engagements that need this now).

Also, I want to make some notes about the "philosophy" here. The entire idea behind having a fast symbolic transition system (symbolic execution), is that you need to figure out which transitions could apply as fast as possible, and you need to eliminate potential transitions as fast as possible. There are many reasons a transition could not apply, this algorithm emphasizes doing the cheapest and most frequently succeeding checks early. In particular, you could have an algorithm that instead did:

For each rule, first:
- Push unification as far as possible, including AC unification.
- Then instantiate side-conditions, and evaluate them fully.
- Then evaluate the entire term fully.

This would be closer to what the current backend does, which is: "finish all algorithms to completion".

But it is often the case that, for example, AC unification is not required because one of the side-conditions in the 3rd step goes trivially to false with a single rewrite rule. So the above algorithm attempts to re-arrange the state pruning checks a bit:

Do rule-indexing to compute ahead of time a bunch of rules that never could apply.
Only do simple fast unification first.
Only do simple fast constraint simplification next (LLVM backend on concrete constraints).
Only then, if there are still multiple states that could apply, go to the more powerful simplification engine.

This is a spectrum we can tune, we have knobs we can control. So if we make the new rewrite engine in front of the current one, and we only do the rule indexing and the simple fast unification first, the above statistics show that that would be enough for roughly 50% of the rewrite steps we take anyway! So it's easy to build this incrementally when we build it on top of the current engine, because we can always fall back to the current engine for (i) state pruning, and (ii) a single step of symbolic execution.

This is the reason I'd argue for tight integration with the current backend. We can lean on it and make it incrementally handle more cases (as we profile and check that it's even necessary). If we go and write our own, we will need to handle 100% of the rules of KEVM before we can benefit from performance improvements.

Here are the requirements:

New backend is being invoked in KEVM master by holiday closure (not necessarily the newest version of the backend, but some version, with a clear path to bringing the newer version in).
Handling at least 50% of KEVM rule steps (not necessarily the version KEVM master is using).
Taking more rewrites/second than the old backend (not necessarily the version KEVM master is using).

I've gone through the implementation steps above again, and made the following adjustments:

Steps that were not strictly necessary to achieve the above three goals were removed (in particular, having implication checking be aware of rule indexing, and LLVM backend FFI).
I added weekly estimates to each task (to be reviewed by the team).
I ordered them in the way that can be implemented.

Some of the tasks are parallelizable. We need to figure out how to assign and implement them to achieve this goal by Holiday closure.

Questions:

is this a replacement for kore-exec or just for kore-rpc? will this eventually need to have feature parity with the old Haskell backend? (I'm guessing that the answer is yes here, but it needs to be made explicitly clear what's the goal here); what modes will the new kore-exec need to have? Answer: just kore-exec --prove
will the new repository be public or private?
do we need the old code to support preserves-definedness anymore? (hopefully not)

New repository
New calls old:
- spinning up the old RPC server and using that to call into the old backend (cons: it's risky because we're not using this codepath when doing proving currently)
- use the old Haskell backend as a library (cons: keep the initialized definition on the side as state, pulling in the dependencies of the old codebase directly) (pros: packaging will be easier)
- this means reimplementing kore-exec
Old calls new:
- we'd have to have the new prover have its own state and the old one would have to keep it, so more changes to the old codebase
- we've already spent a lot of time packaging the old codebase nicely
- ~it will have to make the new one public if we use the new one as a library~
- we could use the rpc protocol to call the new one, this has some overhead because it needs setting up
- will the new codebase have to support branching?
kore-exec --prove is implemented in pyk:
- new backend implements execute endpoint
- pyk handles the logic which calls the new backend and falls back to the old one

Observations:

eventually, we will have FFI bindings; this could replace using the kore-rpc protocol in the k-repl architecture

We can work on the following right now (without deciding on how we integrate with the old backend), in order:

design new datatypes (https://github.com/runtimeverification/hs-backend-booster/issues/4)
port the kore parser to our new datatypes
design effect stack
after 2., we can start on the (in no particular order):
- load procedure (which does rule indexing, preserves-definedness)
- the simple unifier
- FFI
put it all together and write tests!

Communicating via RPC: easier to isolate issues/behaviors and debug Long-term:

kprove implemented via pyk/kore-rpc
people implementing their own proving strategies using a symbolic execution API

Either way we'd want an rpc server for the new backend.

execute, simplify and check-implication need to be separate!

New backend execute should support cut patterns. We can use the simple unifier here, because the RHS of any realistic claims will have some free constructor that is its rule index (marks that this is a terminal state). This way, we can avoid having to implement check-implication now.

The goal by Holiday is the have this integrated into kore-exec --prove. Let's not worry about RPC for the moment.
Private to start.
I don't think the old code should need to support preserves-definedness anymore.

What do we need an effect stack for? Can we try to avoid getting into the polymorphic/monomorphic mess that we got into for prior backend? The Monad stacks seemed to be a source of those problems.

What do we need an effect stack for?

That's just the way you write meaningful Haskell programs. We can avoid the polymorphic/monomorphic mess by not using MTL.

Make a json-rpc server, with a single endpoint {"command": "execute", "cut-patterns": { CTerms } , "depth" : -1}, which matches the existing execute endpoint, but has an additional failure state that can be returned.

@ehildenb it would be much easier for us to support cut-rules instead of cut-patterns (like the old backend does). Do you think cut-rules would suffice (at least for this initial implementation)?

I think we should probably strive to support cut-patterns instead of cut-rules, because I think that will be more useful longterm, but cut-rules also works fine for now.

Note that the cut-patterns we intend to use are all indexable (with the rule-index), so the vast majority of times all you'll need to do to make sure that the cut pattern doesn't match is to make sure it has a different rule index from the current term. For KEVM and IMP at least this will be the case, I don't know about other languages (I think it's likely though).

I don't think it makes a big difference for V1, but if you guys think it will be easier, let's do cut-rules.

How does the frontend translate X ~> foo(Y) ...

Kast: KSequence([KVariable(X), KApply('foo', [KVariable(Y)]), KVariable('_Gen0')])

Which in Kore: kseq(EVar('VarX', SortApp('SortKItem')), kseq(App('foo', [], [EVar('VarY', SortApp('SortBar'))]), EVar("Var'Unds'Gen0", SortApp('K'))])])

runtimeverification / hs-backend-booster