Open johnynek opened 7 years ago
pretty much a dup of #1736
Another interesting rule is the following
val p1 = p.filter(fn1)
val p2 = p.filter(fn2)
if p1
and p2
have no other children and don't merge back into a single mapper (via de-diamonding), then we might want to go from p -> p.filter { x => fn1(x) || fn(x) }
since function application is cheap, then we can make sure we don't checkpoint a giant data set just to filter it out downstream.
Another rule: a.join(b).toTypedPipe.join(c)
could be a.join(b).join(c)
and make sure we stay in 1 map-reduce job.
also: a.join(b.join(c).toTypedPipe) == a.join(b.join(c))
This may be a big usability win so extra .toTypedPipe
calls don't change the efficiency of jobs.
If you do a join of just two items and one of them you are summing, you can avoid every materializing more than one thing into memory. Assume you are doing: left.join(right.sumByKey)
in this case, you could sort so the right side comes first, then accumulate the iterator while it is right into the summed value, then when you see the first left, take the sum and .map
it into the remaining iterator.
These kinds of joins are pretty common, so it might be worth it.
If we merge #1666 and continue that with putting Grouped, CoGrouped, and HashJoinable in the AST, we could do a number of optimizations, fairly easily if we steal the summingbird graph optimizer code that does not actually depend on summingbird. Some examples:
a.join(b ++ c) == a.join(b) ++ a.join(c)
and(a ++ b).join(c) == (a.join(c) ++ b.join(c))
, so we should be able to do that in 1 map-reduce step.