lspector / Clojush

The Push programming language and the PushGP genetic programming system implemented in Clojure.
http://hampshire.edu/lspector/push.html
Eclipse Public License 1.0
331 stars 94 forks source link

Add Higher-Order Function Instructions #147

Open thelmuth opened 9 years ago

thelmuth commented 9 years ago

It would be cool to have higher-order function instructions like map, reduce, and filter in Clojush. Maybe they'd only be defined on the array-like types (string, vector_integer, vector_boolean, etc.), or maybe they'd somehow be defined on other stacks as well.

The details are vague, and there are some non-trivial things to figure out, but if anyone wants to take this on it would be cool!

lspector commented 9 years ago

We do already have code_map, which could be used as a model for others. Also, several exec instructions have related functionality.

thelmuth commented 9 years ago

There are also exec_do*vector_integer and similar instructions for the other vector stacks.

Vaguery commented 9 years ago

I would be happy to add these. I've done them at least twice in push-forth and duck, and it's a good place for me to get a better sense of adding comprehensive midje tests as I work. :)

The bigger question, of course, is how do you want to handle big swathes of functionality like this? A few instructions at a time, or a suite all together?

thelmuth commented 9 years ago

Personally, I'd say do it however you like! If you want feedback, a little at a time might be good, but most things "never get done", so if you're going to "get it done", then do it your way :)

Vaguery commented 9 years ago

So let it be written.

lspector commented 9 years ago

Please be sure, though, that the execution of a single instruction is always well bounded in space and time, since we only check for limits being exceeded between instructions. I'm pretty sure we've been careful to do this -- for the definition of "well bounded" that has mattered for us in practice -- for every instruction currently in Clojush, but I also think (though I'm not sure) that @Vaguery has had a somewhat different philosophy on this (maybe because of different ways to catch things when they run out of control).

Vaguery commented 9 years ago

My idea right now is to bring over something like the :enumerator type I built in one of the other languages. It is a direct extension of the forms you are used to with exec_do_etc and code_doetc`, and so it's completely reentrant and iterative in the same way.

Briefly: am enumerator contains one immutable sequence (I will start with vectors here) and an associated persistent "pointer state". As with code there is no need that the sequence be single-typed, and one enumerator may contain a vector_integer and another a vector_string or a string item.

When created, the enumerator's pointer is set to 0.

A standard library of pull the top enumerator from that stack and push the following to exec:

Note: in most of the implementations I've worked on where the goal was extending Push, I did not want anything reaching into the associated sequence and changing it. In duck things are more fluid, and as I recall the stack-affecting messages like :dup and :pop can be received by an Enumerator instance as well, so all kinds of madness happens.

Anyway, @lspector, there is no more than one additional copy of the sequence ever, and so I think the computational implications of this approach are in keeping with the Push philosophy at least as much as code_do_times and so forth are.

thelmuth commented 9 years ago

I didn't fully digest all of the details, but this looks cool to me. I like the idea of the enumerator stack. It always felt a little scary to me that indices for the exec iteration instructions are kept on the exec and integer stack, where they can be messed around with easily. An enumerator stack that keeps track of that stuff without multipurposing the exec and integer stacks seems like a clean way to avoid that mess. Thumbs up!

Vaguery commented 9 years ago

Is there a pre-existing idiom (or convenience function) for adding a new type to clojush.globals/push-types? I would prefer to have the responsibility for registering a new type live in the definition of that type, so people who don't want to use it don't end up with extra chaff added to globals.

thelmuth commented 9 years ago

Nope. The only current way that I'm aware of is to just add it to that list of stack types.

Vaguery commented 9 years ago

I could use some feedback on the work I've done so far. It's not in any sense "done", mainly because there's a huge amount of repetition in the instruction definitions, but I'd like to know if it makes sense so far to an external reader familiar with Clojush's overarching structure.

You can compare to the more-or-less intact Clojush state using the branch comparison view

thelmuth commented 9 years ago

Definitely looking like a good start! A few comments:

Otherwise, it's hard to tell if things look "right" without comments about what they're supposed to do.

Vaguery commented 9 years ago

Do you know if there is room inside define-registered for a docstring?

Good catch on the return state. I was just getting around to getting rid of all the repetition that masked that from me.

Do the midje tests, which explicitly say what every thing should do in every edge case, help clarify what the code is intended to do? I will also add comments because that's the standard and because the docs are generally pretty bad, but the intention is always best spelled out in the test suite.

Vaguery commented 9 years ago

Do you know if there are any instructions that don't start by setting the metadata of their requirements, and (fn [state]...?

(and I mean when they're "unrolled", too; I see for example the inner function of popper in the example is equivalent)

thelmuth commented 9 years ago

Do you know if there is room inside define-registered for a docstring?

I don't know! You could always try and see if anything breaks.

I will also add comments because that's the standard and because the docs are generally pretty bad, but the intention is always best spelled out in the test suite.

I didn't check, but will. Though, that assume there aren't any typos/bugs in the test suite. And also: English is easier to understand than someone else's code.

Do you know if there are any instructions that don't start by setting the metadata of their requirements, and (fn [state]...?

Not that I know of. But, there's no reason you couldn't use another idiom. Each instruction has to be a function that takes a push state and returns a push state, and needs metadata associated with it that tells what stacks it uses, but you can get there any way you like.

lspector commented 9 years ago

The define-registered macro, as currently written, cannot take a doc string. This has annoyed me and I've resorted to using comments in the past, but it could certainly be fixed. define-registered would have to be written to take an optional doc string and to attach it as meta-data to the instruction name in the appropriate way (which I haven't investigated, but is probably not hard, modulo the usual mental orientation/clarity that's needed when writing macros).

Vaguery commented 9 years ago

Was just about to open this as an issue :)

Vaguery commented 9 years ago

I've written a little testing helper function called safe-execute in my midje tests which will

  1. apply execute-instruction to a state
  2. raise an exception if the returned value doesn't have all the push-types as its keys (e.g. when it's nil)
  3. return the result otherwise

I am making an effort to use this in every test call of an instruction, so that every returned value from every edge case is validated. Obviously this is too much work for the interpreter to do in the normal course of running a program (?), but it's really pretty spooky that this sort of thing (blithely ignoring nil arguments) is still lurking in there.

thelmuth commented 9 years ago

Was that last comment supposed to go in #153?

Vaguery commented 9 years ago

Will do enumerator_map, enumerator_set and the remaining conversion instructions tomorrow.

A few questions of preference:

What have I missed?

Vaguery commented 9 years ago

@thelmuth No, it as intended for here, as a response to the bugs you caught!

thelmuth commented 9 years ago

Do you want to permit "looping" behavior? I'm guessing this would be a boolean flag, settable by a toggle instruction, which when true would permit the Enumerator to "wrap" in both directions when the pointer moves beyond the currently permitted bounds.

This sounds a bit crazy. Good crazy? I don't know. It would be a lot easier to have infinite looping behavior with the looping you're talking about.

re: enumerator_map: I would love to be able to use this instruction (and similar ones) without having to use the :code stack at all. The :code stack is great when you want to use it a lot, but adds a ton of extra complexity (and a larger instruction set) if you otherwise don't need it. Would it be possible for some of these instructions to use the :exec stack instead, or to have 2 versions, one which takes the code from :exec and one from :code, like exec_do*range and code_do*range?

Vaguery commented 9 years ago

No, I'm sorry, I can't possibly see how anybody could do that. Good grief, man, these aren't computers we're... oh wait.

Yes!

Vaguery commented 9 years ago

While we're discussing things: as I've mentioned, this isn't my first go on the Enumerator ride. One set of instructions (from push-forth I think) I had sort of set aside because I figured you'd think they were pretty weird. These take an entire stack and create an Enumerator out of that. enumerator_from_all_integers and so forth, I guess they would be here.

Just sayin'.

My favorite is one I suppose would be called enumerator_from_all_enumerators....

thelmuth commented 9 years ago

While we're discussing things: as I've mentioned, this isn't my first go on the Enumerator ride. One set of instructions (from push-forth I think) I had sort of set aside because I figured you'd think they were pretty weird. These take an entire stack and create an Enumerator out of that. enumerator_from_all_integers and so forth, I guess they would be here.

I'm all for this! While it might be sort of crazy, it seems like it could be very useful at times.

Vaguery commented 9 years ago

There still should be some discussion of what reduce and filter might be in a Push setting. Realizing that one doesn't want to do them "all at once" (because that's the Push way), I'm not entirely clear what makes sense.

The sort of canonical reducer is injecting + over a list, right? I'm not sure what it means to "reduce" a list by "applying" an arbitrary Push code item—with unknown arity—to even one item, let alone "combining". Even the map I've sketched above feels a bit like making a sandwich by carefully putting a slice of bread between every item....

Vaguery commented 9 years ago

I will postpone "looping" for the time being.

Vaguery commented 9 years ago

I notice a lot of the instruction names are pretty long. Can I get away with renaming them enum_thingie rather than enumerator_thingie, or is the naming standard to always use the complete stack name? Either way is good for me, just loads of typing. I understand enum is probably an overloaded string for many programmers, and it's not at all the same thing as an enumerator.

Vaguery commented 9 years ago

Quick judgment call: Should instructions like enumerator_map_code leave the code item in place on that stack after they've exhausted the enumerator, or pop it?

lspector commented 9 years ago

I haven't looked into the enumerators yet, but the general principle is to pop whatever you use. If it's straightforward to dup it first, so that you can easily get the effect of not popping when that's what you want, then I don't see a reason not to follow this general practice. However, that can sometimes get messy for things involving code and exec stack manipulations, and I don't immediately recall if we've already made exceptions for some of those.

Vaguery commented 9 years ago

That makes sense, and confirms my expectations.

So what it's doing now is leaving a copy of the top :code item until it deletes itself. This permits the same kind of mayhem code_do_... instructions do, since the code item "mapped" can change the state arbitrarily; it also permits the code item to rewind the enumerator, so more mayhem. But it doesn't unnecessarily duplicate the top code item.

The equivalent enumerator_apply instructions will pop the code or exec item when it's "applied".

thelmuth commented 9 years ago

I notice a lot of the instruction names are pretty long. Can I get away with renaming them enum_thingie rather than enumerator_thingie, or is the naming standard to always use the complete stack name? Either way is good for me, just loads of typing. I understand enum is probably an overloaded string for many programmers, and it's not at all the same thing as an enumerator.

If you go with enum_thingy, then I'd change the stack name to :enum. That would be fine with me, since we already have the abbreviated :exec stack, but I don't particularly care either way.

lspector commented 9 years ago

I don't think we should forbid abbreviation, and Tom's right that we already have exec. I do personally generally prefer to avoid abbreviations by default, but I agree that brevity is desirable. On this specific case I think that the overloading that @Vaguery mentions is real and worrisome. If I saw an enum stack and enum_* instructions (and I definitely agree with @thelmuth that stack names and instruction prefixes should agree), then I would definitely assume that they're for small finite sets of elements with symbolic names --- that is, enums I've seen elsewhere. I still haven't studied the enumerators being developed here, but I gather that the idea is quite different.

Vaguery commented 9 years ago

What C calls enum I would call categorical if I were to import my types to handle small symbolic sets....

lspector commented 9 years ago

If you shortened categorical to cat while also shortening enumerator to enum then you would guarantee maximal confusion :smiley:.

Vaguery commented 9 years ago

Oh no, I can guarantee a much higher level of confusion, just by pointing out that Clojure permits Unicode function names....

clojush.core=> (defn ℤ⊕ [a b] (+ a b))
#'clojush.core/ℤ⊕
clojush.core=> (ℤ⊕ 8 12)
20

Of course we would push that to the ℤ stack, with is clearly the integers....

lspector commented 9 years ago

But then you'd know you're confused. Confusion is maximal when it is unexpected!

Vaguery commented 9 years ago

$$c log(c)$$

lspector commented 9 years ago

Precisely, but quantifying the c you care about is always the rub.

Vaguery commented 9 years ago

OK, I think I have things squared away. In order to get a minimal viable set in folks' hands, I did not implement some tricky and odd things

However, the four higher-order functions are present:

Still not digging "apply" in those; really they're more like some other combinators I don't know the name of: they "apply" items from the source stack to the elements of the enumerator, consuming them as they progress. See the examples in the tests, which I worked out by hand just to make absolutely sure things were progressing as expected.

I've got very good test coverage on vector_integer, since that was the type I used in all the principle testing. I will spend some time in the next few days making sure there are equivalent and comprehensive tests for the other enumerator types, just in case there are some weird edge cases for strings or something.

Also, I have not yet refactored the "creator" code into a single higher-order definition. Will do that in a second pass after I have filed the rest of the automated tests, using the tests to confirm that the refactoring doesn't break everything.

Have a look at the codebase please; I've been attaching it to @NicMcPhee's fork (accidentally) since the other night when I committed my branch to his fork instead of my own, and never made the time to shift it over. It's not his fault!

Vaguery commented 9 years ago

(that said, I am just now about to do a run with this stuff turned on, so... soon)

Later: Yeah, no; I'm going to need some explicit help from somebody telling me how to set up pushgp to actually use

What has to happen? The existing setups make me think some things need to be invoked with require, others with use? Very confused.

lspector commented 9 years ago

Trying to catch up quickly on enumerators... just read some of the thread above... looks great.

I think Tom's more recent experience adding types will make it easier for him to give you the definitive list of things that have to happen. But a couple of notes:

thelmuth commented 9 years ago

Later: Yeah, no; I'm going to need some explicit help from somebody telling me how to set up pushgp to actually use

We generally use :use instead of :require. I think what you will need to do is the following:

I'll now take a look at the code and let you know if I spot anything else. Let me know how that goes.

thelmuth commented 9 years ago

Some questions:

Otherwise, looking really cool! I'm excited to give these a whirl.

Vaguery commented 9 years ago

@thelmuth

Vaguery commented 9 years ago

@thelmuth Let's talk more about proper map and other functional instructions. Here are the things I can determine, thinking about the instruction set we have already:

For numeric vector types, the basic arithmetic would be easy to implement as map or reduce functions. The question, in terms of language design, is how to recognize and obtain the particular instructions, without blindly proliferating a new suite of instructions: How would the process described by map integer_add over this vector, taking an integer as the second argument find the integer_add?

As I mentioned above, one approach would be to use metadata associated with the instructions themselves to facilitate this: What if a better map instruction looked at the code or exec stack, found an instruction as such, looked for an argument of the right kind to match the types of the instruction... that might work for a lot of them.

The trick here is the same one that interferes with (or complicates) matrix operations in Push: We need to choose between

  1. being willing to fail to find complementary arguments; for instance, we would need by chance to find an integer->integer function and a vector_integer accidentally to be next to each other on some (?) stack.
  2. being willing to "stage" the second argument to suit the first: pick a function rom the place they come from, and then examine its input type(s) to find the correct item(s) on the indicated stack.
  3. being willing to "scrounge" for second arguments to suit the type demanded by the first: if our function takes floats, we could pull the top thingie that is made of floats from a mixed source (this is like the matrix dimension constraint, where if we pick an $$m x n$$ matrix as the first argument, we would need to hunt for the first $$n x k$$ matrix to multiply it).

The first one seems like it's not especially helpful, and unlikely to be very evolvable. The second one breaks a fundamental paradigm of the instructions we've written so far, which is that all arguments are pulled first and then we do the math. The third one opens a can of worms that makes this more like push-forth, where this sort of thing is built in (because there is only one stack for every type).

One way of thinking about the second and third paths, I suppose, would be to build a kind of "map closure" or "continuation" object out of a map instruction and its first (instruction or code) argument, and then give that the behavior that push-forth uses to shuffle through the stack looking for a match.

Call this mythic instruction supermap_exec for now. Suppose it takes a single argument from the exec stack, which must be an instruction or code item. It pops the next item on exec, and if that item isn't "compatible" it swaps position with it in the exec stack. If the popped item is "mappable", it consumes it and becomes a map_closure item, still on exec. When a map_closure item is executed it pops the next item from the exec stack, and if it's the right type it applies the map function to it (one set of arguments taken from the stacks), and we're done. If it doesn't find a compatible second argument in the next position on exec or wherever, it swaps position with it.

The problem I've seen in push-forth is that these little continuation form gliders start crawling up the stack and consuming everything in turn. No instruction goes un-executed, but if there's a lot of mismatches there can be a large bolus of continuations drawing up the stack before you're done.

The only tricks are, this feels pretty un-Pushy, and it means there needs to be a lot more metadata associated with instructions (and possible code items).

Let me walk through an example for clarity:

exec:(supermap 2 4 [8 7 2] integer_mod 8) int:() int_vec:()
exec:(2  supermap 4 [8 7 2] integer_mod 8) int:() int_vec:()  ; supermap don't like "2"
exec:(supermap 4 [8 7 2] integer_mod 8) int:(2) int_vec:()  
exec:(4 supermap [8 7 2] integer_mod 8) int:(2) int_vec:()  ; supermap don't like "4"
exec:(supermap [8 7 2] integer_mod 8) int:(4 2) int_vec:() 
exec:([8 7 2] supermap integer_mod 8) int:(4 2) int_vec:() ; supermap don't like "[8 7 2]"
exec:(supermap integer_mod 8) int:(4 2) int_vec:([8 7 2]) ; supermap likes integer_mod!
exec:((map integer_mod SCALAR INT onto INT_VECTOR) 8) int:(4 2) int_vec:([8 7 2])  ; closure formed
exec:(8) int:(2) int_vec:([8%4 7%4 2%4])  ; closure arguments taken from stacks
exec:() int:(8 2) int_vec:([8%4 7%4 2%4])  ; boring old 8
thelmuth commented 9 years ago

Before reading too deep into your post, I think here's one way we're differing on these thoughts: You seem to want to ensure that the instruction(s) used as the mapping function output the right type and will find the right types. Let me come back to this idea in a second, when you get to [1].

I would personally go down another track entirely. Yes, I would somehow indicate the collection of the output of the map. Not sure exactly how -- there used to be a :types stack in Push that would be helpful here, but maybe there's a better way. You could have "baked-in" input and output types -- instructions like enumerator_map_integer_vector_to_string. Maybe this would result in "too many" enumerator instructions. We currently have ~6 enumerable types which would mean 6^2 = 36 such map instructions, which might cross the "too many" threshold.

Another idea that just came to mind is to somehow use epigenetic markers to denote the input/output types for higher-order functions. This might be overkill/weird, but it could also potentially be a nice way to only have 1 map instruction, but be able to have it define any input/output combinations.

[1] Ok, back to this thought. Personally, I'd lean toward letting evolution "figure it out" regarding the correct output types. You give the map instruction a block of code, and if it doesn't leave the correct output on the right stack, tough noogies. I don't know if the penalty would simply be a failing map instruction, or a shorter output vector, or what. Same with a filter instruction -- if there's no boolean on the :boolean stack, default to false and don't include that item. It seems like this way would avoid a lot of the complications that you're getting into. Now that I've read more, maybe this suggestion is like your (1) in your list?

Another problem I'm seeing with your suggestion is that you'd (well, I'd) often want to map a block of instructions, not just a single instruction. Your push-forthy ideas seem like they'd be even crazier to do with "more than one instruction functions".

Another crazy thought: what if we had a :function stack? The functions would be blocks of code with input and output types, and would execute in their own scope as to not disrupt the rest of the program. AHA! This actually sounds like something we already have, believe it or not -- the :environments stack, which is something @lspector and I were working on year(s) ago, with the purpose of being able to execute a block of code without disrupting other stack contents -- a local scope. We were doing it for other reasons (I think it was geometric semantic crossover?), but this could totally be co-oped for this purpose. I don't have time to think about the ramifications right now, but this could be a pretty elegant solution.

Ugh, I should be working on my defense talk rather than this, but this is so much more interesting. In fact, I should probably put off all of this stuff until after my defense. But, I don't want to put you on hold if now is when you can work on it. Anyway, if I don't respond for a while, that's why. :crying_cat_face:

thelmuth commented 9 years ago

On a more concrete note, I just realized that any instructions you define that you expect to use a parenthesized block of code from the :exec stack, such as enumerator_map_exec, need to have metadata indicating how many parenthesized blocks they require. See exec_if and exec_dup for examples.

Vaguery commented 9 years ago

Dude. DEFENSE.

Vaguery commented 9 years ago

Major clarification

I just was chatting with @NicMcPhee and I realize you might think I am considering this current branch for submission. This enumerator type was simply intended as a design spike for a more robust suite of types and instructions that actually do what we discover they should do. It's not going to be submitted for a Pull Request, and all the code will be thrown away! It's just a way of sketching in code, and eliciting points of "resistance" from the codebase and from you folks.

I quite like the idea of a proper higher-order function, along the lines Tom and I are sketching above. But that would be something I'd do in a test-driven way, and using a lot better refactoring techniques. Please recall this is the first Clojure (or Lisp or Java) code I've ever written.

Never use the first code!