lspector / Clojush

The Push programming language and the PushGP genetic programming system implemented in Clojure.
http://hampshire.edu/lspector/push.html
Eclipse Public License 1.0
331 stars 92 forks source link

Why does defining a "duplicate instruction" throw an error? #156

Open Vaguery opened 9 years ago

Vaguery commented 9 years ago

Specifically this line throws an exception when an instruction is registered with the same symbol as a previously registered one.

Unfortunately, this seems to be the breaking point whenever the entire Clojush system is subject to test: the instructions (identical, quite literally) a0 and a2 are defined in both clojush.problems.boolean.mux-11 and clojush.problems.boolean.mux-6, and others will also probably crop up if I get past this problem.

I would much rather be able to test the entire system all at once, rather than selectively turning on and off conflicting badly-separated modules and hoping they don't hide other interactions and conflicts. Should I

  1. change the instruction names in one or both of these conflicting problems
  2. change the "duplicate" detector so that it actually fires only when an instruction is redefined with different value, not (as in this case) when the same key is assigned the same exact function value
  3. adopt the behavior of most other Clojure libraries I've seen, in which the last-defined map item just wins, without raising a fuss at all (in other words, disable this error checking)

What is not immediately clear is why this throws an error at all. Any memories bubbling up?

Vaguery commented 9 years ago

Actually I've been thinking and I guess I'm even more confused now.

Looking at this whole sequence it seems as if there is a set called registered-instructions and also a completely independent map called instruction-table, both of which contain every instruction, but only the latter of which serves as the canonical function storage structure.

Is there any reason that registered-instructions isn't just a reference to (keys instruction-table)??

thelmuth commented 9 years ago

It looks like that check was added here: https://github.com/lspector/Clojush/commit/a014db549268d8508561b135409e61ec663b5feb

I don't remember exactly why, but it seems like a good idea. My guess is that something Really Bad happened because we were redeffing an instruction without meaning to. Besides for testing, I can't think of a reason to load more than one problem file at once. So, it would probably be preferable to find a workaround that works for testing but doesn't get rid of the check.

A solution you didn't mention would be to somehow namespace instructions that are defined in problem files. This seems like a more Clojure-y solution -- it allows the same instruction to be used in multiple files, but by requiring those files, the instruction would have to be namespace qualified. I'm not sure exactly how this would work, since define-registered is a macro and all.

As to your second post -- this is a good question, and I'm not sure what the answer is!

Vaguery commented 9 years ago

Actually, once the interpreter is a sequestered entity in its own right (and not a global pile of everything) this becomes moot, and it would be Really Weird if anybody actually redefined an instruction. The more desirable behavior, I suppose, is that you load only the instructions you want for that problem into the actual interpreter you run, and then you can query the interpreter state itself to do stuff like make random code.

Vaguery commented 9 years ago

I also note that if I completely delete all the problems, this trouble goes away :) While I'm not suggesting that, I am wondering now about a modular architecture in which Push (the interpreter), ClojushGP (the pile of search thingies) and the example problems are separated and tidy.

On a related note: several of the "problems" are actually running huge amounts of setup code immediately when they load. :/ Not encouraging.

NicMcPhee commented 9 years ago

I agree with @thelmuth's suggestion that instructions for different problems should probably be namespaced to avoid collisions. Here the instructions meant the same thing, but I could imagine a bunch of scenarios where things like move could mean wildly different things for evolving robot control vs. chess players.

I also like @Vaguery's idea of ultimately splitting the Push interpreter into a separate project from all the Research Stuff. My experience, however, is that kind of separation is tricky as "quick experiments" often bleed into the "beautiful core" (the Push interpreter in this case) in annoying ways. Here, however, it might be easier because the semantics of instructions can be specified fairly separately from the interpreter, which would be cool.

lspector commented 9 years ago

@thelmuth wrote: "My guess is that something Really Bad happened because we were redeffing an instruction without meaning to."

That's my vague memory too.

In the current code base there's a pretty deep assumption that only one problem file will be loaded at a time. I'm not sure how many things will break if we change that.

The idea of namespace qualifying instructions does seem like it would be the Clojure way, but I'm not sure it would play nice with all of the ways that we pass around lists of symbols. I've run into trouble with this sort of thing in other projects, where what I expect to be the same symbol isn't because I defined it in one place but quoted and passed it from another. My expectations for this probably stem from how symbols are used in Common Lisp and Scheme... Conceivably the right approach would be to use Clojure keywords, which don't have namespaces... but if the impulse here is to add namespace-qualification to allow multiple version of instructions to co-exist then this would be a step in the wrong direction.

lspector commented 9 years ago

@Vaguery wrote: "Is there any reason that registered-instructions isn't just a reference to (keys instruction-table)?"

There may be different a historical explanation, but the only reason I can think of for keeping it this way is that the lookups in registered-instructions might be faster. But I'm not confident that that's even true, given the optimizations that Clojure and the JVM can perform.

Vaguery commented 9 years ago

I was actually implying something like

(def instruction-table ... [it's a hash map]...)

(defn registered-instructions [table] (keys table))
lspector commented 9 years ago

(def instruction-table ... [it's a hash map]...)

Presumably it's actually an atom containing a hash map.

(defn registered-instructions [table] (keys table))

So you plan to pass it different tables? That's interesting... what's the intended use case?

I note that this means 2 function calls and whatever's required to compute keys for every check to see if an instruction is registered... maybe nothing to worry about, but I don't know.

Vaguery commented 9 years ago

@lspector:

In the current code base there's a pretty deep assumption that only one problem file will be loaded at a time. I'm not sure how many things will break if we change that.

Actually, my plan is that there can only ever be one problem definition file loaded, at least as far as any Interpreter object can see. That is, evaluating a program in the context of a problem definition will create a unique self-contained Interpreter environment (possibly with just plain default instruction sets and all), will run that in its own self-contained thread, and that then it will either disappear or be updated or be reinitialized with the next script or training case.

In other words, I'm not at all trying to "load more than one", but rather to enforce this restriction at the appropriate point of responsibility: the entity which creates the interpreter itself, not the user or developer as such. As things stand now, with all the examples piled on top of one another in the codebase, it's impossible to actually test that code base comprehensively. And this particular test, locked away deep inside the inner (essentially private) behavior of the interpreter setup, doesn't let that happen.

lspector commented 9 years ago

@Vaguery:

The more desirable behavior, I suppose, is that you load only the instructions you want for that problem into the actual interpreter you run, and then you can query the interpreter state itself to do stuff like make random code.

Sometimes you want to include literals in random code that aren't in the instruction set and/or to customize the probabilities for different instructions. This is done via the atom-generators argument in pushgp, Even for different applications (search methods, etc) I suspect that we'll want finer grained control over random code than just uniform sampling from all of the available instructions.

lspector commented 9 years ago

In other words, I'm not at all trying to "load more than one", but rather to enforce this restriction at the appropriate point of responsibility: the entity which creates the interpreter itself, not the user or developer as such.

Sounds very reasonable in principle! But I'm not sure what it means, concretely, for the point of responsibility to change, or what implementation you have in mind. Right now an exception is raised if a problem file (or any file) is loaded that re-defines an already-defined instruction. But maybe that's not the right time/place to raise the exception, and it wouldn't complain if you load two problem files that don't define clashing instructions. Both would presumably define argmap, but these vars would be in different namespaces (unlike Push instructions), and I'm not sure if that's a concern or not.

Vaguery commented 9 years ago

Working on this now, and I just need to understand the code better so I don't break anything subtle you've cooked in somewhere else:

So as things stand now, the symbol for each registered instruction is added to the global set registered-instructions with a uniqueness check that throws an error if an instruction with a duplicate name is registered. Then, assuming that doesn't throw an exception, the global hash-map atom instruction-table is updated to include a new pair, with a key set to the (symbol) name of the instruction, and the associated function definition as its value.

This doesn't change in the course of a run, right? I mean, the set of instructions doesn't update in the course of any given simulation, does it?

In addition, various other things important for the Plush representation and random code generation are stored in instruction-table in the metadata of the individual key values, such as the stacks the instruction pulls from (or possibly pushes to), the number of parentheses associated with that instruction in the Plush representation, and maybe some type hints?

OK, so if I understand this correctly, these are used in the following way in the current codebase:

Vaguery commented 9 years ago

@lspector I think maybe it's not clear, there will be no global argmap when I'm done. There will be a stateful Interpreter, which has all of the state information contained inside it. So no functionality will be lost, and things like adding literals to the instruction set will happen at the level of the Interpreter's internal state, not a global.

Vaguery commented 9 years ago

In any case: I will make absolutely sure that if somebody redefines an instruction (in the context of the Interpreter they're setting up) with the same name, an exception is thrown.

Is there a need for somebody to disable a default (say "core") instruction that might be turned on by default? If so, should they be able to add a new one in their problem definition that replaces the default functionality with something they want instead? Say for instance they are interested in using Clojure's + operator for integer addition, rather than the type-safe + one in integer_add?

lspector commented 9 years ago

@Vaguery wrote:

This doesn't change in the course of a run, right? I mean, the set of instructions doesn't update in the course of any given simulation, does it?

I don't think that any existing code updates the set of instructions dynamically. It might be nice to allow it, but I don't think anyone has assumed that you can do it. I think it'd be fine to assert that it's forbidden for now, and later to make an explicit mechanism for doing it if we decide to allow it.

Vaguery commented 9 years ago

One final question: Is there any architectural reason the instruction set is in an atom, and not just defined in the shared namespace?

lspector commented 9 years ago

@Vaguery wrote:

One final question: Is there any architectural reason the instruction set is in an atom at all?

So that instructions can be added incrementally, in different files (including problem files).

Vaguery commented 9 years ago

OK so now I am confused: Doesn't (assoc hash-map key new-value) update it without the persistent mutable state? I mean we're not changing it during a run, and every interpreter from a given problem definition should arguably refer to the exact same table, so there don't seem to be concurrency issues, so…?

What am I missing?

Vaguery commented 9 years ago

OK so that all just crystallized for me as I drive to pick Barbara up: I'll trim out this particular issue as part of a larger suite I've been wrestling with. Then what I'm going to do is pull instructions , including all the metadata and information for representations, into records. Defining one will be much more explicit (and won't involve a macro), everything will be consistent and well-surfaced, and it'll be at least as easy to add new ones in problem definitions.

lspector commented 9 years ago

OK so that all just crystallized for me

I'll have to see more before I know if it's a dilithium crystal or a dark crystal...

OK so now I am confused: Doesn't (assoc hash-map key new-value) update it without the persistent mutable state?

That call to assoc returns an updated hash-map but doesn't, as you note, persistently update anything. So if you want to have separate expressions in separate files all contributing to the table then you have to update an atom... as far as I can see.

Vaguery commented 9 years ago

@lspector

I'll have to see more before I know if it's a dilithium crystal or a dark crystal...

Lee, I'm really trying to help this project become usable by people who don't already know everything about it. It's taken me six months to learn enough about it to be able to propose a consistent architecture that's extensible and maintainable, but then again I'm not a very good programmer. If you're worried I'm going to break something, I understand and I can just fork it and not bother checking with you. But I value your insights into how it has been to teach with it, and how easy (or hard) it has been to extend it and maintain it so far.

Vaguery commented 9 years ago

That call to assoc returns an updated hash-map but doesn't, as you note, persistently update anything. So if you want to have separate expressions in separate files all contributing to the table then you have to update an atom... as far as I can see.

Right. I think I see it now: You're saying it's because there's nothing wrapping the state of the hash-map, and so there's no calling entity to maintain it. It's just updated by the loading process in a piecewise imperative way, and there's nothing to "catch" the updated state except the atom itself.

I'm thinking of it as being encapsulated inside a persistent object of some sort (an Interpreter, for instance), and therefore imagining that a hypothetical register-instruction would return the updated Interpreter record to the calling entity. In that case, it doesn't need the mutable state because some other, surrounding process (like the problem file itself) is ultimately responsible for building and preserving the resulting state.

lspector commented 9 years ago

@Vaguery wrote: "I can just fork it" -- Noooooooooo! I absolutely super value what you're doing and want it in the master branch. Just joking about the crystals, trying to say that I didn't yet understand that part.

lspector commented 9 years ago

@Vaguery wrote: "I'm thinking of it as being encapsulated inside a persistent object..." -- Do you have a sense of how the kinds/amounts of changes this will require in existing instruction definitions and problem files?

Vaguery commented 9 years ago

LOL. You easy to tease.

OK, so only a worse programmer than I am would actually change your codebase for something this dramatic. I've started a strongly-tested branch that is building out an Instruction record type now, and will build the accompanying Interpreter record shortly. Basically these are aimed at changing the instruction declaration syntax by making it simpler (by automating away repeated code in the current system, like argument checks and dependencies) but also much more explicit (by requiring doc strings and argument types).

Since the driving entity is clearly the problem definition file, I'll just assume that 100% of the current examples and experiments should remain untouched until somebody checks and finds out how many already fail to run as things are right now (hint: it ain't zero).

Rather I'll build a parallel library in place that tries as much as possible to duplicate the functionality of the current language. The syntax for setting up Push interpreters will be very similar, and of course the interpreter loop itself will be almost identical, but the instruction "registry" will be integrated into an Interpreter object, and the syntax for defining instructions will be much simpler and abstract.

So I guess the answer is: barely detectable for anybody who wants or needs to use any current stuff, but very different in the new parts, so that everybody has a chance to try out the new affordances.

I can't see a way to safely automate or incorporate the current untested instructions, and to be honest the process of writing tests for them is fraught because of all the overlapping argmaps and stuff. So I'll point to the tests for the new integer_add instruction as the warrant for it's correctness, but those tests (since they necessarily exercise the code) can't be used for the old one.

Bottom line: plan for migration and transition.

Vaguery commented 9 years ago

Clarification (can't edit these on phone): Basically the use case for Clojush remains exactly the same in both schemes. The user (student, researcher, whoever) makes a new settings file that specifies what instructions, variables, data to bind to those variables, simulators to link in, how long to run the interpreter, and so forth. But no GP stuff should remain in that specification: this is just to set up an Interpreter and run it.

Another "outer" layer, basically "search", will have access to that setup system but the point is when a GP process says "what's this program's state after running?" this module will run that program and pass the state back. No scoring, no crossover, no representation, just Push code and variable bindings in, Push state out.

lspector commented 9 years ago

@Vaguery wrote:

LOL. You easy to tease.

Demonstrably. So easy that it shouldn't even be fun :-).

The plan sounds good!

Vaguery commented 9 years ago

Actually, to follow up with a real keyboard: There's no substantial difference in the resulting architecture at the level I can see from here. An "Interpreter" is almost exactly equivalent to a self-contained push-state plus the stuff currently in interpreter.clj, with a bit of extra syntactic stuff Clojure provides to make it more efficient, and also with tests and opinionated convenience functions to help ensure it doesn't go haywire if somebody wants to add a new type or instruction for their problem.

lspector commented 9 years ago

Bringing the currently global stuff into push-states sounds like the right idea to me too, and putting it that way helps me to see what you have in mind.

NicMcPhee commented 9 years ago

So if you want to have separate expressions in separate files all contributing to the table then you have to update an atom... as far as I can see.

You could have everyone do something gross like (def m (assoc m x y)) every time you add something to the instruction set, but an atom is (I think) a much more Clojure-y way to handle that changing state. An alternative would be ref and ref-set, but I think that swap! is prettier than transactions.

Ultimately, though, your plan to move that state into the interpreter sounds better than any of the above. In some ways the "need" for the atom there indicated the presence of global state that ultimately wants to go away.

lspector commented 9 years ago

You could have everyone do something gross like (def m (assoc m x y)) every time you add something to the instruction set, but an atom is (I think) a much more Clojure-y way to handle that changing state.

Re-deffing would definitely be a no-no with respect to Clojure style, and unless you explicitly declare it dynamic it may not even work (depending on Clojure version).

An alternative would be ref and ref-set, but I think that swap! is prettier than transactions.

I think that transactions would only be called for if we had to change multiple global things in a coordinated way. I think of them when you have to do something analogous to moving money between two accounts in a bank, where each account is separate but both have to change at the same time.

As long as we want a global instruction table, and we want separate expressions in separate places to augment the table, I'm pretty sure that an atom is the best solution.

Ultimately, though, your plan to move that state into the interpreter sounds better than any of the above. In some ways the "need" for the atom there indicated the presence of global state that ultimately wants to go away.

I'm still not 100% sure what this would look like, but I imagine that you would create an interpreter and then pass it through lots of functions, each of which would return the interpreter with an additional instruction added to the interpreter's instruction table, and then you'd use the resulting interpreter to run your program. But when would you do all of this instruction-table building? If you do it every time you want to run a Push program then that sounds slow and cumbersome to me. If you do it just once when you load a problem and stick it in a global then I'm not sure we've gained much over the current approach. I can imagine an intermediate approach where you stick it in a let-bound local, but again, I'm not 100% sure what this would look like.

In the current system there is something local-ish that's sort of like an instruction table: the atom-generators in calls to pushgp. This is used for random code generation. The idea (maybe not a good one, ultimately) is that the vocabulary for random code generation has to be problem-specific, but that the instruction table can be global and contain everything; it just won't see a lot of what it contains in particular applications. Where this runs into trouble, of course, is where there are name conflicts. The current idea (again, maybe not a good one, ultimately), is that we can keep all non-problem-specitic instructions conflict free, and not worry about conflicting instruction names in problem files because only one of those will be loaded at a time.

lspector commented 9 years ago

P.S. Note that atom-generators in pushgp are not actually local instruction tables, both because they don't associate names to implementations and because they can contain literals and functions that are called at code creation time (which is how we implement ephemeral random constants for genetic programming). But they do provide a vocabulary for programs that's local to a problem.

NicMcPhee commented 9 years ago

As long as we want a global instruction table, and we want separate expressions in separate places to augment the table, I'm pretty sure that an atom is the best solution.

I agree entirely. I was just walking through the alternatives, but not as coherently as you did.

I'm speaking for @Vaguery here, but I think the plan would be to build just one interpreter at the start of a run that can be re-used throughout. So one would build it up (including loading up the relevant instruction definitions), and then make it available at the point where scripts need to be interpreted. That could involve passing it through a lot of functions, especially at the beginning, but I think some additional refactoring should be able to clean that up some.

saulshanabrook commented 9 years ago

@Vaguery said much earlier:

The more desirable behavior, I suppose, is that you load only the instructions you want for that problem into the actual interpreter you run, and then you can query the interpreter state itself to do stuff like make random code.

That also makes the most sense to me, to remove as much global state as possible and pass a list of instructions to the executor, in some form or another.

@Vaguery is this the direction you are pursuing, or are you refactoring in a different way?

Vaguery commented 9 years ago

Yes, @saulshanabrook: Each Interpreter will have a self-contained hash of instructions.

saulshanabrook commented 9 years ago

@Vaguery :+1: So no more register-instruction?

saulshanabrook commented 9 years ago

@Vaguery Any timeline on these changes? I am hoping to refactor the logging and if this is going to land soon, I will wait on it before starting.

Vaguery commented 9 years ago

Not soon. Maybe two weeks, if I stay healthy and have enough money to keep the heat on.