plumatic / plumbing

Prismatic's Clojure(Script) utility belt
1.49k stars 108 forks source link

Plans on full support of nested calculations? #41

Open guv opened 10 years ago

guv commented 10 years ago

I have scenario where I have a root graph for the main calculation and a subgraph that needs to be applied to a collection of input maps. Currently, this is implemented via two different graphs together with a function for applying the subgraph to each input map. The mentioned collection of input maps will not fit into memory in general.

But that way I need to manage dependencies on properties of the subgraph in the root graph myself. It would be nice if that could be done with prismatic.graph.

Are there any plans to wnable something like that? Or is this already possible with the current implementation using some implementation pattern?

w01fe commented 10 years ago

Currently the semantics of Graph are that each function is called (at most) once to produce a value, so the way you're doing it now is how to do it.

I guess it's possible to consider adding subgraphs that 'map' over a set of values, but I'm not sure how this could be represented declaratively. We currently don't have any plans to add it, but if you have ideas I'd be interested to hear them.

On Thu, Jun 12, 2014 at 3:57 AM, Gunnar Völkel notifications@github.com wrote:

I have scenario where I have a root graph for the main calculation and a subgraph that needs to be applied to a collection of input maps. Currently, this is implemented via two different graphs together with a function for applying the subgraph to each input map.

But that way I need to manage dependencies on properties of the subgraph in the root graph myself. It would be nice if that could be done with prismatic.graph.

Are there any plans to wnable something like that? Or is this already possible with the current implementation using some implementation pattern?

— Reply to this email directly or view it on GitHub https://github.com/Prismatic/plumbing/issues/41.

saulshanabrook commented 8 years ago

I am also looking into (what I think is) the same thing. From the mailing list:

As such, it isn't really suited to model more dynamic computations where based on the data you may do different things, or call a function multiple times. It may still be useful inside or outside those computations, but the dynamic part itself can't be represented within graph currently.

saulshanabrook commented 8 years ago

I have implemented this type of calculation using the current graph library, but am evaluating alternative ways of implementing it. I will go over:

  1. The high level goals I see for subgraphs
  2. My specific use case
  3. My current solution
  4. Other solutions I am considering

Any feedback/suggestions would be greatly appreciated. This type of discussion seems better suited to the message board, but I like the UI of Github Issues better, so I am posting it here. If you would prefer though, I am happy to move the discussion over.

Subgraphs

Subgraphs are nested calculations that will be run multiple times. They get some of their values from the parent graphs and others each time they are evaluated.

My Use Case

I am creating a general purpose metaheuristic search framework. Primarily, it is intended for research into genetic programming.

The whole point of the framework is to be able to be able to modularly assemble a search technique (that includes the algorithm, the problem, and many configuration parameters) in order to run it and look at the results.

The results are a sequence of Generations. Below is summary of the schemas:

(def Traits
  "Traits are any information we know about an individual or a generation. For example
   this could be how well it does on a certain test case or the best individual in the current generation."
  { s/Any  (s/maybe s/Num)})

(def Individual
  "One individual in the population"
  {:genome s/Any
   :id s/Str
   :traits Traits
   :parent-ids #{s/Str}})

(def Generation
  "Holds the whole state for a current generation of individuals."
  {:index s/Int
   :traits Traits
   :individuals #{Individual}})

The whole process is run by a graph that has a :generations keys. The configuration problem is how to assemble that graph from a bunch of different. Then to actually get the result, we just run the graph and pull off the :generations key.

To compute each Generation, you need the last Generation (as well as all the search configuration). So computing the next generation from the previous is a problem that could be tackled by a subgraph. Also, each :genome and :parent-ids for the next solution is generated from the previous generation's individuals. The :traits are in dependent only on the :genome of the Individual, so this would also make sense as a subgraph.

Goals

While getting nested calculations to work is a necessity, I have some other goals that shape how I evaluate possible solutions

  1. Easy to test each individual calculation in isolation
  2. Can wrap each calculation in profiling or logging, by wrapping the whole graph
  3. Can parallelize calculations
  4. Able to many partially defined graphs to get the final graph
  5. Don't compute extra nested calculations if they aren't necessary
  6. Have parts of nested computation depend on other parts of the calculation
  7. Produce visual representations of the execution flow in an automated manner
  8. Works with existing plumbing Graph tools (compiling, analysis)

Solutions

... continued below

saulshanabrook commented 8 years ago

Solution 1: Functions at nodes (currently implemented)

Our current approach is based on using functions as many of the graph values. So they are defnks that return fns. This let's us split of the computation of a node, into the parts that are constant throughout the search and parts that are dependent on the invocation.

For example, we often want to stop searching after a certain fixed number of generations. So the max-generations->done? fnk takes in a max-generations from the graph and returns a function that takes in the current generation and returns true or false if the index of that generation is past the max-generations. So then if we set our :done? key in our graph to be max-generations->done? we can modify the max-generations key on the parent graph to change this config value. The core search function then can just be a fnk that takes in a done? key and uses that function to check if we are done.

Examples psuedocode for how thise works graph:

:generations (fnk [done? step initial] (take-until done? (recur step initial))
:done? (fnk [max-generations] (fn [gen] (> max-geneartions (:index gen))
:initial (fnk [->genome population-size]) (repeatedly population-size ->genome))
:step (fnk [select tweak population-size] (fn [generation] (->make-new-generation generation (-> generations select tweak))

Goals

1._Easy to test each individual calculation in isolation

:+1: We can test each node easily, we just call it once with the top level config parameters to produce the function then again.

(is (not ((max-generations->done? {:max-generations 10}) (c/complete {:index :1} Generation))))
(is ((max-generations->done? {:max-generations 1}) (c/complete {:index :1} Generation)))

2._Can wrap each calculation in profiling or logging, by wrapping the whole graph

:ok_hand: We can wrap each node to check if it returns a function. If it does, we can wrap the function it returns in the required code. This is more complex than wrapping each node directly, but it does work.

3._Can parallelize calculations

:ok_hand: All the parallelism has to take place on a case by case basis inside the functions (example using pmap).

4._Able to many partially defined graphs to get the final graph

:+1: To combine graphs, we can just use the g/graph function to nest them linearly. Since they are all at the top level, we don't have to do any merging of keys or graphs. Genetic algorithm example:

  (g/graph
    :population-size (fnk [] 1000)
    :initial (g/instance initial/->genome-> [population-size] {:n population-size})
    :evaluate evaluate/genome->traits->
    step/graph
    :generations base/generations))

5._Don't compute extra nested calculations if they aren't necessary

:-1: Lazy graph computation doesn't help us here, because the computation actually happens when calling the function. What I mean is that, in each generation, all the individuals and traits for those individuals are calculated fully. But we might not need to calculate a certain trait for an individual, if it doesn't get selected for another reason.

6._Have parts of nested computation depend on other parts of the calculation

:-1: It isn't possible to have :traits "depend" on :genome of an individual explicitly. It has to take place inside a function that takes in a whole generation and does a mapping. Also, we want to calculate the :best-individual in the generation, and also maybe the :best-traits which are just the :traits of the :best-individual. That is currently not easy, because the :best-individual isn't a node, it's a value that is calculated. So we have to calculate it at at the same time as the :best-traits. What I mean is that the :traits of a generation might look like this:

{:best-individual ...
 :best-traits ...
 :average-traits ...
}

With this strategy, all of those would have to be calculated by one function that produced the total :traits of the generation. It wouldn't be easy to mix in those piece by piece in the configuration.

7._Produce visual representations of the execution flow in an automated manner

:ok_hand: We should be able to produce graphs of what functions require one another, but we don't get much of a sense exactly how each inner computation works.

8._Works with existing plumbing Graph tools (compiling, analysis)

:ok_hand: It is just a normal graph, so compiling it is no problem and works fine with any method. However, things like lazy graphs aren't helpful in this case, because it isn't the evaluation of nodes that takes time, but the execution of the function they return.

w01fe commented 8 years ago

Thanks for the detailed write-up.

A possible alternative within the current framework would be to use two graphs:

The two graphs could share input parameters, etc.

saulshanabrook commented 8 years ago

Yeah I have been thinking about that approach a lot in the past couple of days.

Nested fnks as leaves

I was actually thinking of nesting the graphs in another like this:

(g/graph
  :gen->gen-graph
    (g/graph
      :->ind-graph
        (g/graph
          :parents (fnk [] (fnk [->parent] (fnk [[:tweak n-parents]] (repeatedly n-parents ->parent))))
          :tweak (fnk [all-tweaks] (fnk [] (fnk [] (rand-nth all-tweaks))))
          :->testcase-graph
            (g/graph
              :calc-y (fnk [evaluate] (fnk [] (fnk [genome] (fnk [x] (evaluate genome x)))))
              :abs-difference (fnk [] (fnk [] (fnk [] (fnk [calc-y actual-y] (Math/abs (- calc-y actual-y))))))
              :abs-difference-sq (fnk [] (fnk [] (fnk [] (fnk [abs-difference] (* abs-difference abs-difference))))))
          :->testcase (fnk [] (fnk [] (fnk [->testcase-graph] (fn [x y] (g/run ->testcase-graph {:x x :actual-y y})))))
          :individual
          {:id (fnk [] (fnk [] (fnk [] (uuid))))
           :parent-ids (fnk [] (fnk [] (fnk [parents] (map :id parents))))
           :genome (fnk [] (fnk [] (fnk [tweak parents] (apply tweak parents))))
           :traits {:program-size (fnk [] (fnk [] (fnk [genome] (count genome))))
                    :testcase-differences (fnk [testcases] (fnk [] (fnk [->testcase] (map (comp :abs-difference ->testcase) (seq testcases)))))}})
      :->ind (fnk [] (fnk [->ind-graph] (fn [] (->> {}
                                                (g/run ->ind-graph {})
                                                :individual))))
      :->parent (fnk [] (fnk [[:prev-gen individuals]] (partial rand-nth individuals)))
      :generation
       {:index (fnk [] (fnk [[:prev-gen index]] (inc index)))
        :uuid (fnk [] (fnk [] (uuid)))
        :individuals (fnk [population-size] (fnk [->ind] (repeatedly population-size ->ind)))})
  :gen->gen (fnk [gen->gen-graph] (fn [gen] (->> {:prev-gen :gen}
                                              (g/run gen->gen-graph)
                                              :generation)))
  :generations (fnk [gen->gen initial] (iterate gen->gen initial))
  :testcases (fnk [test-fn testcase-xs] (zipmap testcase-xs (map (test-fn testcase-xs)))))

That way any nested graph can pull whatever it needs from parent graphs. This doesn't require any changes to the plumbing libraries, since each time it compiles a graph it should unwrap one of the fnks. Also it is just one huge map. It would be a bit harder to wrap each nested fnk, but I don't see it being impossible.

One advantage of this approach is that it makes clear where each value is unique to. For example the ->parent function is determined in each generation, so it is in the key at that level.

This however becomes pretty darn verbose.

saulshanabrook commented 8 years ago

Return subgraphs from leaves

We could also do it by not nesting each fnk but instead nesting each subgraph itself in a fnk.

(g/graph
  :testcases (fnk [test-fn testcase-xs] (zipmap testcase-xs (map (test-fn testcase-xs))))
  :gen->gen-graph
    (fnk [population-size all-tweaks actual-value evaluate testcases evaluate]
      (g/graph
        :->ind-graph
          (fnk [all-tweaks actual-value evaluate individuals testcases evaluate]
            (g/graph
              :parents (fnk [->parent [:tweak n-parents]] (repeatedly n-parents ->parent))
              :tweak (fnk [all-tweaks] (fnk [] (fnk [] (rand-nth all-tweaks))))
              :->parent (fnk [[:prev-gen individuals]] (partial rand-nth individuals))
              :->testcase-graph
                (fnk [evaluate [:individual genome]]
                  (g/graph
                    :calc-y (fnk [x] (evaluate genome x))
                    :abs-difference (fnk [calc-y actual-y] (Math/abs (- calc-y actual-y)))
                    :abs-difference-sq (fnk [abs-difference] (* abs-difference abs-difference))))
              :->testcase (fnk [->testcase-graph] (fn [x y] (g/run ->testcase-graph {:x x :actual-y y})))
              :individual
              {:id (fnk [] (uuid))
               :parent-ids (fnk [parents] (map :id parents))
               :genome (fnk [tweak parents] (apply tweak parents))
               :traits {:program-size (fnk [genome] (count genome))
                        :testcase-differences (fnk [->testcase] (map (comp :abs-difference ->testcase) (seq testcases)))}}))
        :->ind (fnk []
                (fn []
                  (->> {}
                    (g/run ->ind-graph)
                    :individual)))
        :generation
         {:index (fnk [[:prev-gen index]] (inc index))
          :uuid (fnk [] (uuid))
          :individuals (fnk [population-size ->ind] (repeatedly population-size ->ind))}))
  :gen->gen (fnk [gen->gen-graph]
              (fn [gen]
                (->> {:prev-gen :gen}
                  (g/run gen->gen-graph)
                  :generation)))
  :generations (fnk [gen->gen initial] (iterate gen->gen initial)))

While this is a bit less verbose, it also requires specifying all the required arguments from a higher level as a inputs to each graph, which isn't DRY. What I mean is that if I had a top level key, like mutation-rate, and I need it in the 3rd level down subgraph, then I have to write mutation-rate in each of the fnks defining each of those subgraphs.

It is also not easily composable, because it isn't a large map anymore, but a bunch of nested fnks. So if I wanted to add a new computation inside the :traits map, this would be pretty difficult. I couldn't just do assoc-in, I would have to wrap each of the fnks, and this seems intractably complicated.

saulshanabrook commented 8 years ago

Partially evaluated fnks

It would be ideal to be able to have non nested fnks and still be able to keep the things as one large map, like this:

(g/graph
  :testcases (fnk [test-fn testcase-xs] (zipmap testcase-xs (map (test-fn testcase-xs))))
  :gen->gen-graph
    (g/graph
      :->ind-graph
        (g/graph
          :parents (fnk [->parent [[:tweak n-parents]] (repeatedly n-parents ->parent)])
          :tweak (fnk [all-tweaks] (rand-nth all-tweaks))
          :->testcase-graph
            (g/graph
              :calc-y (fnk [evaluate genome x] (evaluate genome x))
              :abs-difference (fnk [calc-y actual-y] (Math/abs (- calc-y actual-y)))
              :abs-difference-sq (fnk [abs-difference] (* abs-difference abs-difference)))
          :->testcase (fnk [->testcase-graph] (fn [x y] (g/run ->testcase-graph {:x x :actual-y y})))
          :individual
          {:id (fnk [] (uuid))
           :parent-ids (fnk [parents] (map :id parents))
           :genome (fnk [tweak parents] (apply tweak parents))
           :traits {:program-size (fnk [genome] (count genome))
                    :testcase-differences (fnk [testcases ->testcase] (map (comp :abs-difference ->testcase) (seq testcases)))}})
      :->ind (fnk [->ind-graph] (fn [] (->> {}
                                        (g/run ->ind-graph)
                                        :individual)))
      :->parent (fnk [[:prev-gen individuals]] (partial rand-nth individuals))
      :generation
       {:index (fnk [[:prev-gen index]] (inc index))
        :uuid (fnk [] (uuid))
        :individuals (fnk [population-size ->ind] (repeatedly population-size ->ind))})
  :gen->gen (fnk [gen->gen-graph] (fn [gen] (->> {:prev-gen :gen}
                                              (g/run gen->gen-graph)
                                              :generation)))
  :generations (fnk [gen->gen initial] (iterate gen->gen initial)))

In order to support running this, I think only two things would need to be changed:

  1. When executing a graph, if you the values for a key aren't available, then leave it as partially evaluated.
  2. If you encounter a value in a leaf of a graph that isn't a fnk or a map, treat it as a (fnk [] v).

With these two changes in place, the top level run of the graph would basically fill in the fnks with all the keys it has, leaving those that still need keys unevaluated. Then as each bit of it get's subgraphs, it can fill them in bit by bit.

We lose some safety however, with these relaxations. We can't verify as much at graph compile time.

w01fe commented 8 years ago

Thanks for the details. I'm not sure if this is the full graph, but my gut feeling from looking at it is that you don't really need the outer layer to be a graph. And if you just make that part an ordinary function, then I think all the complexity goes away? Or if not, I still think you'd be better off having them be two separate graphs. First you compile the generation one, and pass it as a parameter to the driver. (Graph isn't meant to model all your logic and computations, just those where the added complexity pays off.)

My feeling is that Graph has a lot of intrinsic "magic" in how data moves around, and so I've tried to eliminate any unnecessary magic and make things as safe as possible while preserving that core. That's one reason, for instance, fnks have to explicitly name the parameters then want and can't just say (fnk [& everything]). With that in mind, my gut feeling is that the changes you propose would cross the boundary for "too much magic" to put in core, and could cause some very confusing errors for people who didn't completely understand the semantics.

saulshanabrook commented 8 years ago

@w01fe Yeah I agree that is is too magic/complex to be in core.

Maybe the problem is that I am depending on Graph for more than computation. It also serves as a way to compose parts of a solutions easily. That probably isn't what you built graph for. For me details check out how I turn a bunch of graph pieces into a search.

I have been thinking about my last example a bunch in the last few days. Trying to figure out how to write fnks like that but still be able to not add too much magic. I figured out how to turn the one with a bunch of fnks that all just list what they need into a graph that can be compiled by the current library.

Basically be able to compile https://github.com/plumatic/plumbing/issues/41#issuecomment-203610483 into https://github.com/plumatic/plumbing/issues/41#issuecomment-203600661 if we just give it one more piece of information. This would be somehow providing in each a subgraph the inputs that are given to it when it is run. For example for gen->gen-graph this would be prev-gen. If we can specify that in the graph, then we can compile it down to a bunch of nested fnks.

For example this:

{
:nested-graph {
:graph-inputs (fnk [] [:input]) // `graph-inputs` is a special key now
:x (fnk [outer inner] (+ outer inner)
:inner (fnk [input] (inc input) 
}
:outer (fnk [] 1)
}

would be precompiled into this:

{
:nested-graph {
:x (fnk [outer] (fnk [inner] (+ outer inner))
:inner (fnk [] (fnk [input] (inc input))
}
:outer (fnk [] 1)
}

Do you know what I mean? I might write a separate library for that. Maybe the current way of just using functions is simpler, but I am curious how this way plays out in full.

The advantages I see of this approach is that it keeps all the current safety of graphs and will even let you know if you forget to define a key in subgraphs when you compile the top graph. I think I would do this by starting at the leaf sub graphs and checking for each leaf if there are any keys it requires that aren't outputted by that graph or by the graph-inputs and bubble them up to the input to the fnk that we wrap them in.

We could then wrap all the inner nodes easily (for profiling etc) by using the un-precompiled graph.

w01fe commented 8 years ago

Cool. I kind-of follow, but haven't taken the time to fully understand the example and motivations behind the pieces. If you end up making something, I'd definitely be interested to take a look -- thanks!