nathanmarz / cascalog

Data processing on Hadoop without the hassle.
Other
1.38k stars 179 forks source link

prune operation if it isn't needed #247

Closed dollschasingmen closed 9 years ago

Quantisan commented 10 years ago

@dollschasingmen A couple questions to help me understand please. What's the gain in doing this? Is there any assumption being made in the pruning?

dollschasingmen commented 10 years ago

main use case:

(def my-query [outvars] (<- outvars (my-gen !a !b) (pred1 !a !b :> c) (pred2 !a b :> !d))

then suppose you call it like this:

(my-query ["!a])

without the pruning code, both pred1 and pred2 would be called. with the pruning code only pred1 would be called. so you can use it to dynamically only request the functions that you need to run in the query. should be more efficient this way (since you just skip calling code that where the output is ignored). in practice, this is best if pred2, for example, is really slow/heavy.

so tl:dr, it only runs code that is actually needed.

assumptions - in theory none :) the main approach of this is to check if the output of a particular predicate is necessary -- aka, is the output var being used somewhere else.

there are a few exceptions to this -- which the test document.

sritchie commented 9 years ago

@dollschasingmen, sorry I never really looked at this. I know it's been a long, long time.

One thing we need to guarantee is that we include filtering variables in the set of variables that we protect, not just the final projection. a query like

(<- [?x]
  (src ?x)
  (* ?x ?x :> ?square)
  (even? ?square))

Needs to return only even squares even though the filter variable's removed. Does that make sense? Let me know if you're still interested here and I can give you a super-fast review turnaround time. Thanks again!

dollschasingmen commented 9 years ago

@sritchie, np, thanks for not forgetting this PR :)

This function doesn't prune filter predicates under any circumstances. The pruning action is limited to only non-filter and non-generator predicates. I added a few more tests to assert that (see latest commits).

Sorry, the function is a bit convoluted. It has been a long time and reading through it was not super simple even for me, ha. Not sure how to make it the logic more obvious/readable (suggestions very much appreciated here), but I re-wrote the function and test descriptions to make it all a bit more clear (hopefully?)

sritchie commented 9 years ago

This looks great. Currently compilation's forced at the subquery boundary, but I can see this advancing down the road to actually filter down across subqueries when the user does some projection like (src ?x _) when using a query.

I'm happy to merge this now, but I do have one more Q / suggestion; I might be wrong here, but reading the code it looks like this optimization might not be able to handle chains of operations that end in some unused variable, like

(<- [?x]
  (src ?x)
  (* ?x ?x :> ?square)
  (dec ?square :> ?square-minus-one))

I bet the easiest way to deal with this would be to run the same optimization you're currently running, but to pass it through a fixed-point function like

(defn fixed-point [f guess]
  (let [next (f guess)]
    (if (= guess next)
      next
      (fixed-point f next))))

and cull all operations backwards. Up to you if you think this is a good idea or worth doing in this pull request. I always feel lame asking for MORE work in a pull request, especially given how badly I dropped the ball on reviewing this one.

Thanks again!

sritchie commented 9 years ago

This is pretty closely related to

https://github.com/nathanmarz/cascalog/issues/24

Both the chained-and-ignored predicate thing and duplicate operations happen a lot as a result of predicate macros.

dollschasingmen commented 9 years ago

yea - good point re: handling chains of operations. i think i got it to work using basically that fixed-point idea, see latest commit.

also agree about the subquery boundary and pred-macros -- sure would be nice to be able to prune into that as well :) but that I might prefer to leave to another PR.

dollschasingmen commented 9 years ago

hm not sure why CI failed, test passed locally

sritchie commented 9 years ago

Looks like a dependency download failure with cascalog-math. Thanks again!

slpsys commented 9 years ago

Sweet!

sritchie commented 9 years ago

Love that this works now:

cascalog.api> (defn throw! [& xs] (throw (RuntimeException. "gotcha!")))
#'cascalog.api/throw!
cascalog.api> (??<- [?x] ([[1]] ?x) (throw! ?x :> ?ignored))
([1])
cascalog.api> (??<- [?x ?ignored] ([[1]] ?x) (throw! ?x :> ?ignored))
FlowException local step failed  cascading.flow.planner.FlowStepJob.blockOnJob (FlowStepJob.java:219)