twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 708 forks source link

Add summingbird graph library as a zero-dep module #1711

Closed johnynek closed 7 years ago

johnynek commented 7 years ago

This moves the summingbird graph code into a zero dependency module in scalding.

This will allow us to later delete it from summingbird and depend on it here if we choose. Also it allows us to use the same graph code to do transformations and optimizations of the TypedPipe graphs.

After this is merged, my next step is to use this to make the new cascading backend better (look for fan-out nodes that the user did not use fork on, so that we avoid recomputation. We can also use this code to apply some optimization rules (e.g. pushing up filterKeys on TypedPipe into the previous step, which we can't currently do).

piyushnarang commented 7 years ago

👍 looks good to me.

johnynek commented 7 years ago

about the question @ttim posed: why do we have Rule on N rather than Literal. The reason is that it is much easier to write rules in terms of the underlying nodes.

See the summingbird rules: https://github.com/twitter/summingbird/blob/develop/summingbird-core/src/main/scala/com/twitter/summingbird/planner/DagOptimizer.scala#L354

There are pretty easy to write and read on the original nodes, but on Literal they would be almost impossible since you need to match on the functions that are embedded in the Unary or Binary operations.

ttim commented 7 years ago

@johnynek I'm not strongly against, I see a point to merge it as is and do some follow-ups.

About Rules on N[_] - I'm totally agree it's simpler, what I don't understand (and that's where my mental model breaks) - if you expressed change of dag structure on the level of N[_] how this can be reflected on dag itself?