twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

optimizations we could do with a full AST at plan time #1669

Open johnynek opened 7 years ago

johnynek commented 7 years ago

If we merge #1666 and continue that with putting Grouped, CoGrouped, and HashJoinable in the AST, we could do a number of optimizations, fairly easily if we steal the summingbird graph optimizer code that does not actually depend on summingbird. Some examples:

johnynek commented 6 years ago

pretty much a dup of #1736

johnynek commented 6 years ago

Another interesting rule is the following

val p1 = p.filter(fn1)
val p2 = p.filter(fn2)

if p1 and p2 have no other children and don't merge back into a single mapper (via de-diamonding), then we might want to go from p -> p.filter { x => fn1(x) || fn(x) } since function application is cheap, then we can make sure we don't checkpoint a giant data set just to filter it out downstream.

johnynek commented 6 years ago

Another rule: a.join(b).toTypedPipe.join(c) could be a.join(b).join(c) and make sure we stay in 1 map-reduce job.

also: a.join(b.join(c).toTypedPipe) == a.join(b.join(c))

This may be a big usability win so extra .toTypedPipe calls don't change the efficiency of jobs.

johnynek commented 6 years ago

If you do a join of just two items and one of them you are summing, you can avoid every materializing more than one thing into memory. Assume you are doing: left.join(right.sumByKey) in this case, you could sort so the right side comes first, then accumulate the iterator while it is right into the summed value, then when you see the first left, take the sum and .map it into the remaining iterator.

These kinds of joins are pretty common, so it might be worth it.