twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.48k stars 703 forks source link

Add a full de-diamonding optimization #1906

Closed johnynek closed 5 years ago

johnynek commented 5 years ago

one pattern that we see commonly on a system we have internally are graphs that look like:

val middle: TypedPipe[A] = ...
val out = (middle.map(f1) ++ middle.flatMap(f2) ++ .... ) // merging several transformations of middle

Currently planners will see this as a fork of middle, each are mapped, then merged back together.

But these map-only operations after the fork until the merge could be done in a single flatMap operation:

val optOut = middle.flatMap(mergeFn)

That is exactly what this current optimization rule does.

cc @stephbian @ianoc

johnynek commented 5 years ago

@ttim can you take a look at this?

johnynek commented 5 years ago

@dieu on our internal test, this saved 30% CPU time and shuffle bytes. It was a job that clearly had two examples of this needless diamond, which motivated this, but you all may have a significant number of jobs at Twitter that will get a win by this optimization.

dieu commented 5 years ago

@johnynek wow, that's awesome, but we need to encourage @ttim to another release :)