Grammar-guided program evolution

This is an idea from @robert-haas on a discord channel.

The original moses created combo trees out of the arithmetic ops (plus, minus, times, divide) the boolean ops (and/or/not) and a few others (greater-than, etc...) The allowed ways in which these can be mutated was hard-coded in an adhoc manner. For example, you can only decorate bools with boolean knobs, and contins with contin knobs, etc.

The current as-moses does the same, except not its atomese and not combo. The knob decoration is still adhoc, hard-coded. If as-moses doesn't know about some op, it can't deal with it (for example -- hyperbolic tangent -- you could hack as-moses to add that, but it would be a hack.The next function to come along would be yet another hack.)

The suggestion is to replace the knob-decoration code with a formal specification of a grammar. There would be a grammar, that defines exactly what a "valid program tree" is. New program trees can be created and mutatated, only if they obey the rules of the grammar.

This would allow as-moses to create and mutate trees in any problem domain, and not just arithmetic+bool.

This is interesting, because some of the problem domains are audio and video, and some of the new kinds of ops include lo-pass filters, high-pass filters, chirp filters, squelch filters, etc. It's hard/awkward to try to boil these down to just arithmetic+bools.

Hey, here's most of what I've been doing with grammar-guided genetic programming (G3P) so far: https://github.com/robert-haas/alogos

The current implementation can be used to evolve Atomese, but also any other language you can capture with a context-free grammar. Other people have also used formal grammars that are more expressive, mostly mildly context-sensitive ones.
I've used Python, partly due to familiarity and ease of prototyping, but also because its relative lack of speed isn't relevant as soon as you have an objective function that takes roughly 0.1 to 1 ms to evaluate. In that case the majority of computation time is spent on evaluation of candidate solutions (no matter if serial, parallel or distributed) rather than the search algorithm itself. So a reimplementation in a performant language wouldn't help much. Adding some intelligence to the search, however, could lead to a major speed up in hard problems.

Agree that 99% of cpu time is in evaluation, and not in the mutation of the trees. However ...

Writing code to export atomese to your system, and then re-importing the results is going to be far more complex and fragile and buggy, than simply doing the mutation in-house, directly. Rainbow of reasons:

The grammar can be specified in Atomese
The tree instances can be specified in Atomese
The data on which the trees are to be evaluated is expressed in Atomese (audio, video...)
All of these can be saved to disk (for later exploration) and distributed over the network (for parallelism) Both disk i/o and network distribution work today, are fully debugged and feature-full.
Discord channels keep hinting that there will be a visualizer "any day now".

Off-topic, but I'm also heavily biased: I spent several years with moses. I learned several key things:

The number-1 most important parameter is complexity temperature. Sometimes, taking a less fit tree and mutating that will give better results than taking a more fit tree and trying to fine-tune it. It provides a way to escape from a local maximum.
Point mutation is OK for fine-tuning, but cross-over is wayyy better for avoiding local maxima. The original MOSES did not do cross-over.
Feature selection and dimensional reduction is very important for high-dimensional datasets.

That's a very convincing list of reasons. I'm looking forward to see the growth of as-moses into this direction!

Some general insights I gained with alogos so far:

A grammar does not only define a language, but also the structure imposed on strings of the language, i.e. the shape of derivation trees (or equivalently parse trees). Therefore the grammar defines 1) the elements in the search space and 2) the neighborhood structure between elements in the search space (at least with straightforward search operators such as node mutation and subtree crossover). Since there are many grammars that can define the same language, and it is possible to transform a grammar by applying language-preserving changes to it, one could try to automatically optimize the neighborhood structure of a given search space. One idea with CFGs is to transform them into a normal form (e.g. Chomsky, Greibach, ...) and thereby enable a somewhat uniform distribution of mutation points within all branches of possible derivation paths - but that could be a misleading idea as well and needs empirical testing to determine its actual merit.
Single-solution based metaheuristics (e.g. a hill climber or simulated annealing) seem to perform not as well as population based metaheuristics (e.g. evolutionary algorithms, particle swarm optimization) in this setting. Somewhat related, the genetic programming community suggests that harder problems should be tackled with larger population sizes, based on empirical observations afaik, possibly explained by needing a broader sample of random programs for making crossover effective. Also somewhat related, diversity preservation seems to be a promising approach to improve the search, i.e. actively preventing the population from getting gradually taken over by the currently best-performing solution and thus losing a lot of information gained along the way. In evolutionary algorithms, tournament selection appears to work better than other ways of parent and survivor selection that have higher selection pressure, i.e. those that tend to always include the best and exclude the worst few candidates. As a word of caution, all these observations come from testing on a limited set of configurations on a limited set of toy-ish problems and should definitely be verified more systematically and on a reasonable set of benchmark problems, which I'm going to collect once I've implemented the methods I find appealing to compare.
Any metaheuristic search will often revisit known elements of the search space, therefore it is essential to keep a memory of earlier evaluations instead of recomputing the objective function on the same candidate solution many times - especially if the objective function is costly. This could be realized with a simple cache throughout one run or a dedicated database that can be reused between runs.
Any metaheuristic search can perform really badly in a single run. In almost any real-world application you'll want to perform multiple runs to drastically improve the probability of finding a sufficiently good result. A restart is often better than spending more time in a search that got itself stuck in some corner of the search space (not one local optimum, but some wider basin of bad solutions). On the other hand, diversity preservation might be an effective measure against that.
There are many ideas in literature how good partial solutions (i.e. parts of derivation trees, which can be sections of a program, but also a certain hierarchical division) could be identified and deliberately reused instead of just being one out of many possibilities during cross-over. This might be be the key for solving very hard problems. I think the best chance of success lies in not trying to do that explicitly, because partial solution quality usually heavily depends on a wide context where it appears, e.g. a number in one place has no meaning in a very different place. I think what could realistically work better is using some deep learning approach for it that learns good partial solutions within a non-explicitly defined context, and for example letting it act as an additional intelligent variation operator that is applied on some individuals (~exploitation), while other individuals still undergo random variation to generate novelty (~exploration).

Summed up: Without better empirical data, I'd by default go with an evolutionary algorithm with a "large" population size (not <50, better 100-500, for tricky problems even in the 1000s), a simple diversity preservation mechanism (e.g. not allowing more than n copies of a candidate solution in survivor selection), and as search operators mutation & crossover & tournament selection.

Let me reply piece-wise. Grammar. There is not one, but two: In "classical" moses, the first grammar is the collection of the legal ways in which the arithmetic ops (+ - * /) can be combined with constants (0,1, 2), variables (x,y,...) relations (<, =, >) and bools (and,or,not, T, F) and misc funcs (impulse, cosine, exp, log...) So that is the grammar of the program trees.

Then there is the second grammar: this is the (implicit) grammar of the dataset that is be analyzed/understood. That grammar is unknown and needs to be discovered.

I'm interested in the first grammar only so much as it helps me create valid program trees. Otherwise, its not all that interesting. Most of my effort is focused on discovering/discerning the second grammar.

The uniform distribution of mutation points is .. interesting. The original moses had something called "EDA" -- estimated distribution of algorithm -- but it was ripped out cause it was painfully slow. Maybe it should be restored? It attempted to assign some kind of Bayesian(?) probability to which parts of the program tree were "good" in some way. I don't recall. It's in Moshe Looks thesis. I haven't read the rest of your post yet, seems you get into this...

Single-solution based metaheuristics (e.g. a hill climber or simulated annealing) seem to perform not as well as population based metaheuristics

FYI, In moses, we generate a handful of starter individuals. Then pick one. Hill climb for a while, until hill climbing slows down/stalls. Then pick another, and repeat. When overall progress stalls, do some cross-over, then hill climb some more. Keep a population of a few hundred of the best ones. Pick the next one(s) to hill-climb or cross-over with a weighted random pick. The weighting has two parameter penalties: one for overall score, and another for overall complexity. So lower scoring individuals are less likely to be picked. And more complex individuals are less likely to be picked.

In almost any real-world application you'll want to perform multiple runs to drastically improve the probability of finding a sufficiently good result. A restart is often better than spending more time in a search that got itself stuck in some corner of the search space (not one local optimum, but some wider basin of bad solutions).

I never-ever saw this. Or rather, I did, early on, and promptly realized the core problem: the selective pressure was way too high. The solution was to enlarge the population, lower the pressure. So, basically, to "automatically run multiple runs". Duhh. Why do it by hand, when you can do it automatically?

There are many ideas in literature

The point here is NOT to use this tech to find a super-whiz-bang program that fits large datasets with high quality. That cannot be done with current algos. Not possible. And anyway, the DL/NN codes do this a zillion times better! Give up on the simplistic dream of evolving a complex program that "fits the data".

All I want out of this is a so-so, mediocre, better-than-random-chance description. I have completely different, unrelated algos to take other bigger, deeper steps. So I want to use MOSES/program-evolution only as a kind of information amplifier, to improve classification only a litle bit -- factors of 2x, 4x, 8x, if more, great, but I'm not asking for much. Just mediocre is enough for what I need. Later stages are used to revise to get better results.

Here's an example, it might be the first place I'll use this stuff. I want to use it for for text segmentation (c.f @akolonin). Anton has characterized 3-4 different text segmentation algos: "transition freedom", "conditional probability", "mutual information", and some others. These are just .. algos. They are really pretty simple algos -- not complicated. As program trees, they are shallow. So I want to automate the generation of text segmentation algos. As a starting point, it would include the ones Anton considered. Then I want to mutate, evolve, adjust these automatically, including automatically tuning parameters. The result does not have to be "great segmentation", just "good-enough".

Then, outside of the above steps, there is a distinct step of statistical data gathering, and construction of a lexis, a discovery of a simple grammar. This step, this second step, will be a much better tokenizer than the one that the genetic programming found. However, it needs to be pumped with some initial data that is not totally random, but is at least sort-of OK. So I want to use the genetic programming to find something sort-of OK, so I can feed it into this second step.

opencog / asmoses

Grammar-guided program evolution #103