Tracking Issue: Top Down Search

mlb2251 / stitch

A scalable abstraction learning library

MIT License

74 stars 8 forks source link

Summary

This would be an alternative to Stitch style merging. Instead of building an invention by merging zippers, you would start with a single hole and repeatedly choose a hole to expand in a classic top down fashion.
The worklist would contain partial inventions, represented by a node subset (where the partial invention is useful) and the list of zippers to the ivars and holes of the invention (these can just be labelled zids).
The core step of the algorithm would be:
- pop a partial invention off the heap (discard if upper bound tells us to ofc) (see "Search order" below for heap details)
- pick a hole in the partial invention to expand (see "Search order" below for how to pick)
- expand that hole, which yields several things:
- new partial inventions corresponding to each of the nonterm/terms the hole expanded to, each with a disjoint subset of the original usages. These will actually be easy to calculate from a lookup table for zid -> list of subsets which gives the subsets of the full nodeset and therefore you can filter these for the nodes in your usage set to get the real subsets in this case.
- choosing to make that hole an IVar (incl threading) as multiarg (if we have the arity for it)
- choosing to make that hole an IVar (incl threading) as multiuse (once for each arg that equals this one)
- possible extension: possibly choosing to do an on the spot improvement of 2nd-beta-inversion, however more likely you just want to keep the arg upper bound relaxed and do what we're working on in main for this where you cast it as refinement.

Visual

this is a bit like finding all 1-grams, then 2-grams, then 3-grams, ... etc. Where each n-gram applies at fewer locations than the previous of course.

Comparison to old approach

It's not clear exactly how this will perform speed-wise. I think the freedom of hole expansion order (see "Search order") might yield some real gains. In the current Stitch the approach of trying out a zipper then trying each of its extensions (where it is a prefix of them) is similar to expanding down that line of prefixes. I think if we prune early on that expansion it saves us from all those extensions, however even in the old stitch those extensions would have been instantly pruned anyways if the first one was pruned. Also each extension does result in a larger invention body so maybe they rightfully wouldnt even be pruned. Basically I'm not sure that there's any clear improvement here.
But there's some chance it could speed it up, and the freedom to choose search order is big!
This would find small invs first which is actually kinda nice for bounding.
The serach order stuff could let it find the biggest contiguous tree chunk quite quickly as well, which it can hopefully concretize into an actual function (or if not itll move onto the next one etc)
All in all this is super intriguing but requires more thought before a commitment to implementing it.

Upper bounding

the upper bound would just be the sum of all the holes as in normal Stitch. Since we never actually tightened this bound at all using multiuse, we wouldnt really be losing anything here.
also, based on the section "Preliminary tests on multiarg upper bound weakening" below, having a frontier of pure holes and not variables is not actually a big deal for fast pruning.

Multiuse bounds

Bounds on multiuse are still challenging in this setting and there's not an immediate improvement in this area. Multiuse is so hard to bound because not only might some unexpanded hole use an earlier arg, but also an unexpanded hole might create a new arg which is later used by a different hole as a multiuse arg.

Calculating subsets

You can map a zipper+node onto another node (this is appzipper in fact).
During precompute time, you can map the total set of all nodes plus a zipper onto another set of nodes with this, then do a groupby on that to get a mapping from a zid to a list of subsets. These are the subsets of roots that have common zipper targets. So it's splitting the space of roots based on their target under the given zid. This doesn't seem too expensive probably. The subsets have a prefix relation too but I don't think you need to get into that, its fine as is.
We expect sparsity so lists prob beat bitvecs. Bitvecs are kinda expensive, a whole bit for everything, its a little unary despite being binary!

Search order

You could use a heap as the worklist and always pick the entry that is largest in terms of size * usages or something like that, which seems like it could pretty quickly bring you towards the best invention
You could also control the hole choice order which is exciting, this could be one of the main benefits of top down search. I think you could do something smart with properly inspecting each hole and carefully choosing the one to expand next. For example something based around entropy, or based around choosing the hole such that max{ s.size() for s in subsets } is itself maximized. I think you might want subsets to be few + big.
- one extension of this is you can immediately expand things that result in a single equal sized subset to the original. Note youll still have to expand this into a variable as well of course in addition to the single subset.
- Canonicalizing order = for any inv there should be exactly one path to it. Avoids redundancies in search. We actually dont need to worry abt this since we always choose a single hole to expand from a given partial invention (even randomly choosing that hole would be fine). Since the only way you get 2 paths to one invention is by having a split in the serach space where one partial invention has two children that are diff holes and go on the worklist - but this is never the case since we only put one hole choice on the worklist.

Entropy in Search Order

how does entropy come into play? Do you want to do low entropy things (where you're pretty confident in them?) or high entropy things (which split the search a lot?)
I think you want low entropy so that you can essentially make a lot of the obvious decisions with as little search splitting as possible. This search method would lead you to find the biggest thing quickly I think, which is what we care about.
Another version is instead of this averaging that entropy gives you, you could just pick the hole with the largest subset.
Also notably size-1 things shouldnt count as penalty for any of this, theyre actually good.
Most of all you dont want to split the search in ways that will make you re-do the same effort a bunch in other branches of search, and you also want to converge as fast as possible on good inventions so that you can improve your bound pruning.
honestly probably the simple objective of "keeping the largest subset as large as possible" is probably optimal because its a greedy approach to finding the best thing and wont get noised up by entropy terms that dont exactly matter (eg the various small subsets that will get easily pruned some time anyways)

Preliminary tests on multiarg upper bound weakening

Also, I did preliminary tests and it seems that weakening the upper bound by raising it to add in the arg cost too (to account for true HOFs putting some tree back into the invention body) has fairly minor impact on speed (2.2s -> 3s or something; going from 159 -> 516 partial inventions; still getting a lot of upper bound and single use pruning).

Multiuse Ideas

I wonder if once you've used up all your arity you should stop trying to gradually expand things and instead just jump directly to all the multiuses based on your existing multiarg args. That would save you from having to do all the remaining expansions when all you can ever get from them is more multiuses.
What if on defining a multiarg you actually searched your holes for ALL potential multiuses of it, and marked them? First off you could just consult this list when checking if something can be multiarg. Secondly you could even on the spot split into the 2^K choices for what multiuse to do, and then proceed using the assumption that no premade args will ever do multiuse (which will let you get much tighter bounds). There are still a few thorns with this:
- you should be able to adapt it to versions of an arg with extra lambdas on top (by tracking what their source tree looked like)
- you would be giving up multiuse of 2nd-beta-inverted things
- you would be splitting into 2^K things, however this is probably worth it for the bounding gains
- currently there are a lot of cases where we can prove multiuse is worse than non-multiuse. I think you can compare the subsets of the multiused vs constant version of drilling down to a node in order to do this here, though you'd have better odds of proving that in a more expanded case so resolving multiuse now isn't perfect.\
- There's still multiuse between a not-yet-defined arg and other locations. However I think you could do something similar to the ivar-refinement method for finding matching subtrees by structural hashing and by doing that among the existing holes you could choose the optimal hole to expand. Wait no you can't because in that case the shared thing was the subtree itself. Well you could still get a weak upper bound for "the best multiuse performance within each usage location separately" using structural hashing style stuff.
Wait what if instead of deciding multiuse we just placed down a "maybe multiuse" node that we'll expand later (like at the very end). After all, choosing multiuse vs constant has very little (zero?) affect on utility so its fine to leave around. That way we arent splitting the search on multiuse ever. The choice is always constant vs multiuse vs multiarg, and I think we can instead split between multiarg and maybe_multiuse. The multiarg is fine bc itll change the subsetting.
Remember multiuse means diff args for diff locations within the subset so finding "all locations multiuse could be used" is actually a separate list for each node in the subset. I also wonder if this maybe_multiuse has subsetting issues. Probably not, you can just treat it as a hole in terms of subsetting.
Ah worth thinking more clearly about this maybe_multiuse thing. Could be a good tool, I think even without it you can probably usually do this "no decrease in subset" check to eliminate a lot of multiarg/multiuse instances. And of course within multiuse i think often if the space actually gets heavily split by choosing to not make it multiuse or by making it multiuse then you gotta look for that. I don't know exactly, maybe this is all part of the "Search Order" hole choice stuff too. Sometimes doing multiuse splits the space and sometimes it doesnt - sometimes making something a constant splits the space and sometimes it doesnt.

Open Qs:

How should expanding something work when it might expand into an app or lam or prim?
It also really feels like maybe in an arity-free way you can find these largest shared sections, then just start trying them out in decreasing order

Top down search idea V2: arity-free

Proposal: remove the "multiarg/multiuse" move from expansion so instead at each step we choose a hole and case-split it on whether it's an App (of 2 new holes) a Lam (of a hole) or a Prim/Var (in which case we case-split on the exact value). This splits the space into disjoint subsets of usage locations. Note that splitting on the exact value of a Prim/Var is the same as the idea of building out the body of an invention (without this we would just be looking at the app/lam structure). The splitting is safe and still enumerates all possible programs because case-splits are safe in general: you produce the set of all possible cases for a given hole, so you don't lose anything.

Also, at each step we would take a pattern with a frontier of holes and fully do multiarg/multiuse/refinement in the moment to get a bunch of finished inventions.

Why is it better to do multiarg/multiuse/refinement at the end?

It's probably faster:

we split the space less early on. Especially given that we do basically no pruning based on argument values or multiuse, it's useless for us to be splitting on these things early on. We've seen how patterns alone can narrow the subsets a ton.
You might worry that we're losing something by dropping arity here but we arent: think about any pattern enumerated and notice there's always one of our old multiarg patterns you could derive from it - none of these are going to waste due to arity because arity was splitting the space within these patterns (that explanation is a little shakey but I think it's true that arity wasn't adding anything).
scaling to higher arities is probably a lot easier when you're making this assignment at the very end because now arity just makes this terminal subproblem more exponential instead of making the whole branching search more exponential
we can potentially do a ton of pruning here along the lines of verifiably non-useful multiuse etc that we were trying to add along the way of search - now that we have our subset as small as possible though those optimizations will be easier and better

in addition to being faster it might let us be more expressive:

now that we're tackling this subproblem in isolation, we can really do it right. We might be able to handle stuff like refinement in a more thorough way especially since multiuse won't have been decided yet. We'll be operating on smaller subsets and it'll be great.

Approaches to concretizing at the end:

this is the main part that I really haven't decided.

First if letting some holes become constants has no effect on subsetting, you should do that.

Now the remaining holes need a set of assignments - constant / multiarg / multiuse / refined

The objective is slightly tricky because multiuse has a utility that varies across usages so it depends on the specific subset

I can imagine a few different broad approaches:

some sort of search much like how i used to do it or how SAT solvers work, where you assign one at a time and explore the space of assignments. You could hopefully imagine searching in a smart order and being able to prune whole branches based on a branch-and-bound thing again.
calling an ILP solver or something like that. The question I have though is how fast is this actually, like is there startup time or can you casually invoke this much like you'd casually sort a list? The idea is hopefully my utility function can be encoded as an ILP formula and then this could do a lot of smart pruning hopefully. It's also a bit of a question mark whether this is encodable as ILP (it's all about choices and multiplications / additions so it seems like it?) and whether that would mess with extensibility eg could it capture refinement too without blowing up.
in some ways ILP is nice and simple because you dont have to write the search algorithm yourself and there's potential for it to find cool optimizations that you didn't think about, but also it might miss optimizations based on constraints you haven't encoded in the problem and there might be this startup time. Also, encoding some constraints might be really difficult e.g. if they require a huge table of who-is-descendent-of-who and such this might get tricky. ILP would also be nice though because it would mean I could play with ILP.
For starters youll want a non-ILP implementation anyways though to have a baseline

Top down search idea V3: argchoices

This builds on the previous two ideas. We will be first doing a largely arity-blind search for patterns with frontiers of holes, but with a few differences:

we will still do arity pruning, and actually in a way that re-uses work between parent and child inventions

we will introduce an arg_choice which is one of the branches of expanding any hole. This is the branch where this node is used as an ivar in the final invention (but the index number is not yet assigned)

we will only fully finish an invention once there are no holes left in the partial invention, at which point it's all primitives / vars / arg_choices

Key ideas

we will maintain an upper bound vector of length equal to subset length, which gives the upper bound contribution of each node. We don't need to worry about use-conflicts since those just tighten upper bounds but we're already safe so it's fine.

we will incrementally solve the arg_choice assignment problem.

This problem is defined as finding an assignment string which is a k-ary string (where k is the arity) of length equal to the number of holes.
The first digit of the string represents the first hole that was added to this partial inv in terms of its ancestors, NOT the leftmost hole or anything.
A string like 010 means the first and third argchoices were assigned to #0 and the second argchoice was assigned to #i.
Note that 101 and 010 are identical inventions so we canonicalize: The first 0 always comes before the first 1 in the string and so on. And 1 is not a valid length one thing only 0 is etc.
We've factored utility into just being a sum of local utilities at each location in the subset (and we're ignoring the less local use-case term since thats negative and its safe to ignore when upper bounding), and our upper bound actually includes the full weight of the arguments as if they're part of the body (since they might get refined) there's no benefit to multiarg vs multiuse within this upper bound.
The fact that this is a local upper bound means that a) it will only go down globally b) it will only go down locally as well during expansions and assignments and refinements. So it's the sum of many things that are all individually guaranteed to decrease.
As we build an assignment we dont adjust each nodes contribute to the upper bound (after all, we dont know if refinement will improve them later) instead all that happens is we subset the locations to agree with the assignment. So to build the assignment 010 we first assign 0 to the first node, then we assign 1 to the second node, then we assign 0 to the third node at which point we subset the locations to the ones where argchoice[0] == argchoice[2] which is actually really easy if you maintain an separate vec for each argchoice with length equal to the length of the subset. Then you can just zip them with the utilities list (of length equal to locations list) and filter on equality between the argchoice arrays and sum the result.
Note that if you had 3 argchoices and had gotten to 00 as an assignment and the subsetting got you such a low score that you were below the upper bound, you can stop without exploring any extensions of that assignment. That prefix alone subsets it enough to lower the upper bound.
Note also that in offspring of this node, 00 will result is an even smaller subset within the parents 00 subset! And since we know that locally each of the utilities for nodes in that subset will only have gotten smaller in offspring, we know that this prefix will fail the upper bound test for this offspring too. This means parents need only to pass along the prefix reject list to their children, and the child can just skip over that prefix when it constructs it, and the child adds onto the prefix list as more things are discovered. Note a child might newly find something like 00 is not allowed even though it was in the parent etc ofc.
note that we wont actually be branching on this or anything, we're literally just vibechecking very rapidly and also in a dynamic programming sort of way we're building up this reject list which will also make the final invention finishing much easier. The big benefit of this is we will end up rejecting whole spaces of partial inventions that would never work because theyd need like 5 args if they didnt want to subset themselves into oblivion. I also suspect we can simply remove this assignmetn and genral arity bit and get a sweet algo for non lambda calculus situations like graphs.

Overall algorithm

Initial worklist is just the single-hole partial invention

loop:

pop the next pattern off the worklist. Use a heap and pop the one with the highest num_expansions*locations or prims*locations or something like that. Basically the goal here is to prioritize large things that are used in many places.
pick a hole in the pattern to expand. The choice should maybe be low entropy or max-max-usages as discussed in "Search Order" at the top of this whole issue thread. Slight implementation needed to quickly capture this since it takes a speculative split to actually rate it.
if one hole doesnt subset the space at all, it should obviously be the choice to expand so any expansion chooser should obey that (note even in those cases you need to add an arg_choice)
Expand it into the following things (some will be finished some unfinished):
apps
lams
prims/vars (split on specific values within these)
an arg_choice (keeps the same total subset)
Prune the new patterns (both finished and unfinished):
prune if you just added a free var in the last expansion
prune for single use (unless this is an arity-0 invention; unless unless we handle those previously which we probably do)
prune for single task (you can keep around a vec of Tasks of length equal to locations to make this easy)
prune if a new arg_choice you just created has the same input across every location (no_opt_useless_abstract)
prune based on upper bound (calculate new upper bound with a simple filter + sum of the original upper bound list but adjusted for the one fewer hole and whatever took its place)
I think we dont need to cover no_opt_force_multiuse i think we already capture that during assignment?
For all patterns (both finished and unfinished):
thruout algorithm whenever you assign a new number to something e.g. 0 at the start, you should autoassign any multiuses that dont cause any subsetting. Do this for both finished and unfinished patterns. This is actually pretty important.
assign 0 to the first arg_choice
search over all assignments one digit at a time in the canonical form (ie everything will start with 0, etc).
as you go, watch out for warning prefixes you have prerecorded.
if unfinished pattern: as you go, record new prefixes that fail upper bounds.
if unfinished pattern: if you hit at least one complete assignment, you can push to worklist_buf, however you should continue to search through the rest of the things to record warning prefixes for the future.
if finished pattern:
- for all complete assignments that pass the upper bound:
- run refinement to generate a set of refinement options (this can also use the subsetting honestly.
- for any refinements that pass the upper bound:
  - calculate an exact utility
  - if that passes the upper bound, push to donelist.

Oh also dont allow an expansion into a lambda on the very very first hole - we dont allow inventions with a lambda at the top

mlb2251 / stitch