Generalized Rule Indexing

Our current Rule indexer is simple and works for KEVM. It won't likely work for KWasm, for example, because KWasm barely uses the <k> cell, instead it focuses on the <instrs> cell.

But the same principle works, which is that there is some part of the configuration you can inspect very quickly to rule out a huge chunk of rules as "never could unify with current term". So we need to compute which parts of the state are those bits to inspect.

This is a sketch of how to go about this process. I'm skeptical if this will ever perform better in practice than just "let the user tell you which cell is the most important to index", but who knows.

First, some definitions:

Convex Anti-unifier: Anti-unification (or least-general generalization, as it's called in some rewriting theory literature), is the process of computing a pattern PAT that is more general than a set of other patters LHS1, ..., LHSN. In Matching Logic, we can compute it directly by computing LHS1 #Or ... #Or LHSN. In the summarizer, we take this even further to compute the convex anti-unifier, which can be done by pushing the #Or down in similar style to pushing #And down in unification routine. Once the #Or cannot be pushed down further (because of two differing symbols on either side), then you replace that subterm with a fresh variable. This is implemented in the pyk library, both a variant that handles pure terms and a variant that handles constrained terms.
Critical pairs: Critical pairs are also well-discussed in rewriting theory. You compute critical pairs by pairwise computing unifiers for the LHS of each rule in the definition. If two rules rule [r1]: LHS1 => RHS1 and rule [r2]: LHS2 => RHS2 have that LHS1 #Equals LHS2 is non-#Bottom, they are said to be a critical pair.
Tri-state matching algorith: A tri-state matching algorithm (introduced here) takes two terms P and Q, and returns either "matches", "does not unify", or "indeterminate".

The point of rule-indexing is to discover very quickly which rules could potentially apply to a given term, by inspecting as little of the term as possible. We will use the concepts of convex anti-unifiers and critical pairs to assist in computing this information. First, we'll start with the more general problem of "Given a set of terms T1, ..., TN and a target term T, quickly filter out which ones could never unify with our target term".

Routine index_term :: Term -> String
- Given a term T, inspect the top-symbol S:
- If S is a free-constructor, return S.
- If T is a kseq with a non-variable first element T[0], return "kseq" ++ index_term T[0].
- Return *.
Initialization time (compute once)
- Compute the convex anti-unifier CAU(X1, ..., XM) := convex_anti_unifier(T1, ..., TN).
- Compute the term matches alpha_I = match(CAU(X1, ..., XM), TI) for each of TI in T1, ..., TN. Note that alpha_I will have the form X1 := P_I1, ..., XM := P_IM.
- Build a matrix where the columns are T1, ..., TN, and the rows are X1, ..., XM (which we'll call "term positions"), and the values in the matrix entries are P_11, .., P_MN (row index first).
- For each entry P_JK in the matrix, also compute the term index PINDEX_JK := index_term P_JK.
- For each row (term position) in the matrix, compute the "critical pair graph": nodes are the terms P_JK (for only the terms where PINDEX_JK =/= "*").
- Sort the rows on this matrix as follows: (i) first by how many elements in the row have index * (less occurrences of * goes higher in matrix), (ii) second by how many distinct connected components the critical pair graph of the row has (more distinct connected components in the graph goes higher in the matrix).
Routine filter_terms :: Set[Term] -> Term -> Set[Term]
- Compute the match alpha_T := match(CAU(X1, ..., XM), T).
- On "does not unify", return the empty set.
- On "indeterminate", return the entire set T1, ..., TN.
- On "matches" with substitution alpha_T with form X1 := Q1, ..., XM := QM, continue.
- Sort the substitution alpha_T to be in the same order as the matrix above.
- For each element of the sorted substitution XI := QI, find the term index QINDEX_I. If QINDEX_I is not "*" then:
- Filter to the set of terms that match QINDEX_I for row I in the matrix, plus the set of terms with index "*" on that row.
- If the remaining set of terms is lower than acceptable threshold, return the current set.
- Otherwise, continue processing the sorted substitution, each time you get an index that is not "*" for a given position, filtering the set of terms again.

Note that this process could be improved in two ways:

The routine index_term could be made smarter, to be able to index terms more frequently.
For a given row in the matrix, once we have computed P_I1, ..., P_IN, we may be able to iterate the process, to compute a more specific rule index matrix for that subterm position.
When computing he connectivity graph for each row in the matrix, we are currently using the term index as a stand-in for unifiability. This is probably fine, but we could also actually just compute critical pairs directly. Supposedly the graph would end up being less connected because we would be able to tell that more terms do not unify, instead of just relying on the rule index to tell us that. It's unclear that the same cannot just be achieved by improving the index_term routine.

runtimeverification / haskell-backend

Generalized Rule Indexing #3769