Add Higher-Order Function Instructions

thelmuth commented 9 years ago

It would be cool to have higher-order function instructions like map, reduce, and filter in Clojush. Maybe they'd only be defined on the array-like types (string, vector_integer, vector_boolean, etc.), or maybe they'd somehow be defined on other stacks as well.

The details are vague, and there are some non-trivial things to figure out, but if anyone wants to take this on it would be cool!

Vaguery commented 9 years ago

@thelmuth after telling you to go practice, I now want to ask you to explain your idea for a sandboxed :function with a bit more of an example. Walk me through something by hand.

Also, I don't understand the comment about the "parenthesized block from :exec", above. I don't actually know what that is. Are you talking about :code literals on the exec stack? For instance in an interpreter

exec:  1 2 exec_quote ( 3 int_mod 4 ) int_add ( )

( 3 4 ) and ( ) are both code literals. The Push interpreter unwraps those into their components when it encounters them, much like Lisp I always assumed, right?

To determine the effective arity and/or type of a given code block, I assume you'd have to do a sandboxed dry run, especially with the contingent statefulness of certain instructions' outcomes. But (3 int_mod 4) has a (non-normalized) type of [integer]->[integer,integer], right?

thelmuth commented 9 years ago

Also, I don't understand the comment about the "parenthesized block from :exec", above. I don't actually know what that is. Are you talking about :code literals on the exec stack? For instance in an interpreter

Ah, I was talking about how the new Plush genomes need to know how many code blocks to open. Let me explain.

When we transitioned into linear Plush genomes, a big reason we did so was to make it so that instructions that expect parenthesized code blocks (maybe what you're calling "code literals") will always have them. When evolving Push code directly, we often found that despite the ability to evolve semantic code blocks, we often (or even usually) found that :exec stack instructions that could manipulate code blocks were followed instead by a single instruction. This makes it a lot harder to do interesting things with :exec stack manipulations, since they often are operating on single instructions instead of code blocks.

So, now we get to Plush genomes, which are linear lists of instruction maps. Since they're linear, and we need to translate them into hierarchical Push programs, we have to have a way to indicate where parentheses should be. We considered a few options, but went with this: any instruction that can make use of a code block from the :exec stack implicitly opens a parenthesis pair. Then, in each instruction map, we have an epigene that is an integer that tells how many closing parentheses to put after that instruction -- the :close epigene.

Some instructions, for example exec_if, take more than one code block (exec_if takes 2). In this case, if the top :boolean is true, the second block of code is discarded, and if false the first block of code is discarded. Thus, exec_if will always operate on blocks of code. They might be empty or have only one instruction, but hopefully evolution will make use of code blocks with multiple instructions more often this way.

We specify the number of code blocks that an instruction uses by the :parentheses metadata in the instruction, like I linked earlier. So, if you want enumerate_map_exec to take one block of code off the :exec stack (which seems natural to me), then you'll want to have :parentheses 1in its metadata.

Grok? We need to write this up, but haven't yet besides a small section of my dissertation.

Vaguery commented 9 years ago

Partially grok. I totally believe I get how Plush genomes work (but yes, write it up)... but instructions that take arguments from :exec take items from :exec, right? For example you mention exec_if: it doesn't take two code blocks, it takes two items of any sort from the :exec stack, doesn't it?

So it seems like you're saying the Plush genome "wants" there to be two code literals ("parenthesized code blocks") on the :exec stack when the interpreter executes it, but it's definitely not guaranteed at runtime. Right? (follow links)

My confusion comes from a sense that one of the central design principles of Push is that the genome isn't anything to do with the interpreter's behavior: that instructions (as executed) have to be capable of handling any argument that might possibly exist on the stack from which it's taken. The interesting thing, though not unique by design, about :code and :exec is that items of any type can exist on them.

thelmuth commented 9 years ago

For example you mention exec_if: it doesn't take two code blocks, it takes two items of any sort from the :exec stack, doesn't it?

Correct. But, barring :exec stack manipulations, we force exec_if to be followed by two code blocks, so it will almost always be followed by them.

But yes, you are correct, the instructions you linked could make it so that an exec_if statement is not followed by a code block but a single instruction (or integer literal, etc.), which is fine. (Note, BTW, that the first two instructions you linked do require code blocks!)

To help you wrap your mind around this better, all of this prescription of parentheses is a translation-time guarantee, not a run-time guarantee. Does that help?

lspector commented 9 years ago

Maybe helps to clarify in a different way (?): When @thelmuth says "we force exec_if to be followed by two code blocks" he means that the Plush->Push translation process will guarantee that in the post-translation Push code exec_if will be followed by two parenthesized sub-programs. This happens before Push program execution. When the Push program is actually executing, if exec and/or code instructions are present, then when an exec_if executes there might be anything (parenthesized or not) or nothing following it on the exec stack.

Vaguery commented 9 years ago

Yes, it does help, though I'm still not 100% I've grokked your idea for mappers as sketched in this comment.

I does seem, though that we ("we" :smile:) could implement a simple convolve instruction right now that

took the next item off :exec
stopped unless it was a bare instruction
found a collection of the same type as its first argument (possibly subject to caveats below)
found scalars of the same type as its other argument(s), if any
applied the function to every element of the vector, using the scalars for every application

Caveats: This would probably be best if we only considered instructions that produce scalar outputs. Because vector_of_environments_—while very sexy in theory—is probably out of reach for the moment.

Concern: One wouldn't want to undertake a huge functional operation without building it out onto the :exec stack in stages, the way we do for :exec_do_* and so forth. Otherwise the counters would be off.

Vaguery commented 9 years ago

Working out an example:

exec:([false true false] true map boolean_xor)
exec:(true map boolean_xor) vecb:([false true false])
exec:(map boolean_xor) vecb:([false true false]) bool:(true)
exec:(boolean_xor) vecb:([false true false]) bool:(true)      <- map fires
...?...                                                       <- how does mapping "unroll"
exec:() vecb:([true false true]) bool:()                      <- [true xor false, true xor true, true xor false]

lspector commented 9 years ago

Another crazy thought: what if we had a :function stack? The functions would be blocks of code with input and output types, and would execute in their own scope as to not disrupt the rest of the program. AHA! This actually sounds like something we already have, believe it or not -- the :environments stack, which is something @lspector and I were working on year(s) ago, with the purpose of being able to execute a block of code without disrupting other stack contents -- a local scope. We were doing it for other reasons (I think it was geometric semantic crossover?), but this could totally be co-oped for this purpose. I don't have time to think about the ramifications right now, but this could be a pretty elegant solution.

This came up in a conversation at GECCO too (I forget with whom).

I definitely think there are exciting opportunities in using the environment stack plus a little bit of something that we haven't fully worked out yet. One idea is to provide a minimal way to get an environment and return instructions so that the package behaves as a function.

Environments get the full Push state, with all stacks, any parts of which they can use or ignore. But changes to that state are thrown away when we return from the environment, except for changes specified by explicit return... and return...pop instructions within the environment. (When leaving an environment we first restore the pre-environment state and then make the changes specified by return... and return...pop instructions.)

We had discussed packaging not only environment and return instructions into single function-creating instructions, but also tying this into the tag system so that the function would be tagged and therefore callable by tag reference.

I don't think we ever implemented any of this (beyond getting environments and return instructions to work on their own) but I still do think it's a worthwhile idea. And it might also open the door to more elegant formulations of ideas involving higher-order functions.

thelmuth commented 9 years ago

I was working on a post saying everything @lspector said, but then got interrupted by having to bathe baby Ben, etc. He gave a good overview of what current environments can do, and it sounds like his other ideas are along extremely similar lines to mine, including the parts about needing to make environments slightly more function-like and attaching tags to them.

For kicks here's the stuff I had already typed, which goes into some detail about how environments work:

after telling you to go practice, I now want to ask you to explain your idea for a sandboxed :function with a bit more of an example. Walk me through something by hand.

Ok, now this. First, let's talk about the :environment stack and environments, which is what we already have. The idea behind environments is to execute some code in a local scope, so that the code doesn't affect the other stacks.

There are two ways to start an environment. environment_new takes the top block of code on the :exec stack and executes it in a new environment. The environment_begin instruction does the same thing, but uses the entire rest of the :exec stack. Environments can be ended in two ways: running out of code on the :exec stack, and reaching an environment_end instruction.

When an environment starts, it pushes the current Push state onto the :environment stack. When it finishes, it pops the top of the environment stack, restoring the state as it was before the environment started. The only way for code inside an environment to affect the rest of the program is to use return_X and return_Y_pop instructions. The return_X instructions have the name of a stack instead of X. When such an instruction is executed, that type is put onto the :return stack, and when the environment finishes it will put the top item on the X stack onto the X stack in the base environment. The return_Y_pop instructions are similar, except they specify a stack to pop after returning from the environment, allowing a program to "consume" arguments.

Vaguery commented 9 years ago

Seems simple enough. Leaves room for some finesses, like pushing the entire state or just components.

We really need to work on the docs, since there's almost no way for anybody to suss all this unfulfilled potential out of the codebase as it exists right now :(

lspector commented 9 years ago

I just copied Tom's description above to the Orientation category of push-language.hampshire.edu.

Vaguery commented 9 years ago

I stepped away for a couple of days, and came back this morning to finish this first design spike. Basically what I wanted from the exercise is to add a new type and some associated instructions, add a new problem definition that invokes those, and be able to run that... and then throw it away and do it again, with a goal of improving the experience.

90% there.

So now it's dying with a stack trace that strongly implies to me that the creation of random genomes requires some sort of extra information. Can we talk about that? Am I supposed to provide additional information for this random code generator, for instance? I haven't written any epigenetic markers, personally, on the assumption that defaults were in place.

[the instructions look like they're getting loaded up here]
...
Generating initial population...
Processing generation: 0
OutOfMemoryError Java heap space
    java.util.Arrays.copyOf (Arrays.java:2882)
    java.lang.AbstractStringBuilder.expandCapacity (AbstractStringBuilder.java:100)
    java.lang.AbstractStringBuilder.append (AbstractStringBuilder.java:390)
    java.lang.StringBuffer.append (StringBuffer.java:224)
    java.io.StringWriter.write (StringWriter.java:84)
    clojure.core/print-simple (core_print.clj:128)
    clojure.core/fn--5398 (core_print.clj:131)
    clojure.lang.MultiFn.invoke (MultiFn.java:231)
    clojure.core/pr-on (core.clj:3322)
    clojure.core/print-sequential (core_print.clj:58)
    clojure.core/fn--5453 (core_print.clj:279)
    clojure.lang.MultiFn.invoke (MultiFn.java:231)
    clojure.core/pr-on (core.clj:3322)
    clojure.core/print-sequential (core_print.clj:58)
    clojure.core/fn--5406 (core_print.clj:143)
    clojure.lang.MultiFn.invoke (MultiFn.java:231)
    clojure.core/pr-on (core.clj:3322)
    clojure.lang.Var.invoke (Var.java:419)
    clojure.lang.RT.print (RT.java:1748)
    clojure.lang.RT.printString (RT.java:1727)
    clojure.lang.ASeq.toString (ASeq.java:21)
    clojure.core/str (core.clj:513)
    clojush.util/open-close-sequence-to-list (util.clj:263)
    clojush.translate/translate-plush-genome-to-push-program/fn--977 (translate.clj:87)
    clojush.translate/translate-plush-genome-to-push-program (translate.clj:67)
    clojush.translate/population-translate-plush-to-push/fn--984/fn--985 (translate.clj:128)
    clojure.core/binding-conveyor-fn/fn--4107 (core.clj:1839)
    clojure.lang.Agent$Action.doRun (Agent.java:114)
    clojure.lang.Agent$Action.run (Agent.java:163)
    java.util.concurrent.ThreadPoolExecutor$Worker.runTask (ThreadPoolExecutor.java:895)

Vaguery commented 9 years ago

By the way, trying to suss out where that was happening, I set :max-generations 0 and it still dies with the same stack trace.

I've pushed the broken state to the current branch, so I can spend some time later trying to identify the problem.

thelmuth commented 9 years ago

The function you linked to is a single instruction map generator. If you are using it thinking it will create random genomes, you're going to have problems. What you want is random-plush-genome.

If you know all of this, let me know and I'll dig further. It would also help to have a pointer to the code where you're making the random code.

Vaguery commented 9 years ago

I'm not actively invoking anything, except via lein run clojush.problems.tozier.float-regression-with-enumerators. All I'm trying to do is run a pushgp problem that uses the new instructions and types.

See the file problems/tozier/float_regression_with_enumerators.clj in this branch

thelmuth commented 9 years ago

First of all, I'm not getting the same exception you did -- did you change the code since then?

The exception I'm getting is that you included the set of all instructions @registered-instructions in your list of instructions, instead of concating it. I'll make a pull request to fix this shortly. Once I fixed that, it seemed to run fine.

One last note: you can easily change arguments on the command line, which makes testing things easier, instead of having to stick things like :max-generations 0 in the problem file. Here's how:

lein run clojush.problems.tozier.float-regression-with-enumerators :max-generations 10 :population-size 20

thelmuth commented 9 years ago

Oh, I also forgot to mention that you're overwriting the default :epigenetic-markers, which is [:close] by default. The way you're doing it will result in programs with all closing parentheses coming at the end of the program. I'd recommend simply removing that line from your argmap.

Vaguery commented 9 years ago

LOL. I copied the argmap from another example, and AFAIK only changed it by putting the require all instructions thing.

Vaguery commented 9 years ago

The list concatenation did it.

Do you want to explain why this isn't an open issue?

thelmuth commented 9 years ago

Do you want to explain why this isn't an open issue?

The problems that explicitly overwrite the :epigenetic-markers with an empty vector do so because their instruction sets include only instructions that do not make use of semantic parentheses. So, including :close epigenetic markers wouldn't cause any problems, but would include a lot of unnecessary overhead, so we turn them off. All other problems use the default :epigenetic-markers from clojush/pushgp/pushgp.clj, which has them turned on.

Vaguery commented 9 years ago

Excellent! So now I've worked all the way from defining a new type, adding instructions, getting them registered, and finally using them in a running example. That completes the first pass!

Have you run the tests I wrote? I'd like to make sure (since you have a copy now) that they run—as long as you've made the adjustment spelled out at the top of the midje tutorial.

thelmuth commented 9 years ago

I just tried lein midje for the first time after adding that thing to my lein profile (which was non-trivial, since it turns out you have to merge it with existing plugins, of which I had one).

When I run lein midje in your enumerators branch, I get this:

Exception in thread "main" java.lang.Exception: Duplicate Push instruction defined:a0
at clojush.pushstate$register_instruction.invoke(pushstate.clj:42)
at clojush.problems.boolean.mux_6$eval13554.invoke(mux_6.clj:40)
at clojure.lang.Compiler.eval(Compiler.java:6619)
at clojure.lang.Compiler.eval(Compiler.java:6608)
at clojure.lang.Compiler.load(Compiler.java:7064)
at clojure.lang.RT.loadResourceScript(RT.java:370)
at clojure.lang.RT.loadResourceScript(RT.java:361)
at clojure.lang.RT.load(RT.java:440)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5018.invoke(core.clj:5530)
at clojure.core$load.doInvoke(core.clj:5529)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5336)
at clojure.core$load_lib$fn__4967.invoke(core.clj:5375)
at clojure.core$load_lib.doInvoke(core.clj:5374)
at clojure.lang.RestFn.applyTo(RestFn.java:142)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$load_libs.doInvoke(core.clj:5413)
at clojure.lang.RestFn.applyTo(RestFn.java:137)
at clojure.core$apply.invoke(core.clj:619)
at clojure.core$require.doInvoke(core.clj:5496)
at clojure.lang.RestFn.invoke(RestFn.java:421)
at midje.repl$load_facts$fn__7980.invoke(repl.clj:206)
at midje.repl$load_facts.doInvoke(repl.clj:192)
at clojure.lang.RestFn.invoke(RestFn.java:397)
at user$eval8043.invoke(form-init5683710115900330494.clj:1)
at clojure.lang.Compiler.eval(Compiler.java:6619)
at clojure.lang.Compiler.eval(Compiler.java:6609)
at clojure.lang.Compiler.load(Compiler.java:7064)
at clojure.lang.Compiler.loadFile(Compiler.java:7020)
at clojure.main$load_script.invoke(main.clj:294)
at clojure.main$init_opt.invoke(main.clj:299)
at clojure.main$initialize.invoke(main.clj:327)
at clojure.main$null_opt.invoke(main.clj:362)
at clojure.main$main.doInvoke(main.clj:440)
at clojure.lang.RestFn.invoke(RestFn.java:421)
at clojure.lang.Var.invoke(Var.java:419)
at clojure.lang.AFn.applyToHelper(AFn.java:163)
at clojure.lang.Var.applyTo(Var.java:532)
at clojure.main.main(main.java:37)
Subprocess failed

Do you not get this error? Is it a midje thing, or is it because it's loading in a bunch of Digital Multiplier problem files, which is what it sounds like?

Vaguery commented 9 years ago

Somewhere (probably the top of every test file) I said to only run lein midje :autotest test until we figured out what the problem with clojush is. (It is a clojush error, in the sense that as a default lein midje will try to run both your src and test forks, because many people who write tested Clojure want to have tests intermingled into their source code apparently)

Also lein midje and 'clojush` may be the most irritating things in the world to type on a computer with autocorrect turned on.

thelmuth commented 9 years ago

Yes, you did mention that somewhere (and in the code), and I had the feeling I was doing something wrong but I had forgotten where to look for it.

After doing that, it looks like it's working great! It looks like midje runs continuously and checks things when files are saved? Nifty! It says its passing all tests.

Vaguery commented 9 years ago

Good, now throw that branch away :)

lspector / Clojush

Add Higher-Order Function Instructions #147