optimizations we could do with a full AST at plan time

johnynek commented 7 years ago

If we merge #1666 and continue that with putting Grouped, CoGrouped, and HashJoinable in the AST, we could do a number of optimizations, fairly easily if we steal the summingbird graph optimizer code that does not actually depend on summingbird. Some examples:

[x] Push up mapValues/flatMapValues after a toTypedPipe on a group/cogroup to avoid cascading tuple boxing
[x] Push up filterKeys after a toTypedPipe on a group to before the shuffle.
[ ] Push a flatMapValues into the joiner on a hashJoin on the right so we don't expand it before we broadcast.
[x] Make sure to force a hashJoin to disk at a good spot (after a filter, but before a flatMap). We do this now, but in a pretty brittle way.
[ ] in map-reduce, serialization is expensive, not so much computation to a good approximation, so we can convert a flatMap to a filter + flatMap where the filter is checking if the flatMap has 1 or more outputs. We can use this to insert filters before communication barriers to hopefully reduce the number of items communicated.
[x] Push up filterKeys to both sides of a hashJoin/cogroup/group (may be done now in all cases, but need to make sure we don't regress)
[ ] Remove cascading Merge nodes before GroupBy (helps with tez, simplifies the cascading graph).
[ ] distribute joins: a.join(b ++ c) == a.join(b) ++ a.join(c) and (a ++ b).join(c) == (a.join(c) ++ b.join(c)), so we should be able to do that in 1 map-reduce step.

johnynek commented 6 years ago

pretty much a dup of #1736

johnynek commented 6 years ago

Another interesting rule is the following

val p1 = p.filter(fn1)
val p2 = p.filter(fn2)

if p1 and p2 have no other children and don't merge back into a single mapper (via de-diamonding), then we might want to go from p -> p.filter { x => fn1(x) || fn(x) } since function application is cheap, then we can make sure we don't checkpoint a giant data set just to filter it out downstream.

johnynek commented 6 years ago

Another rule: a.join(b).toTypedPipe.join(c) could be a.join(b).join(c) and make sure we stay in 1 map-reduce job.

also: a.join(b.join(c).toTypedPipe) == a.join(b.join(c))

This may be a big usability win so extra .toTypedPipe calls don't change the efficiency of jobs.

johnynek commented 6 years ago

If you do a join of just two items and one of them you are summing, you can avoid every materializing more than one thing into memory. Assume you are doing: left.join(right.sumByKey) in this case, you could sort so the right side comes first, then accumulate the iterator while it is right into the summed value, then when you see the first left, take the sum and .map it into the remaining iterator.

These kinds of joins are pretty common, so it might be worth it.

twitter / scalding

optimizations we could do with a full AST at plan time #1669