twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.49k stars 703 forks source link

Add ComposeReduceStep optimization #1826

Closed johnynek closed 6 years ago

johnynek commented 6 years ago

related to #1669

a.join(b.sumByKey.toTypedPipe) should be converted to 1 map-reduce job with this optimization.

We really need more tests to show not only that the optimizations don't change the results (which is what we have now) but that they actually work in cases we think they do (and that they aren't thwarted somehow by aspects of the graph we don't notice now.

A related optimization (but not this one) allows a.join(b).toTypedPipe.join(c) to be a single map-reduce job. That is a follow up to the current optimization.

johnynek commented 6 years ago

Here is a new rule, not yet turned on by default, but I think we should definitely consider it.

This is one of the more powerful optimizations we can do because it reduces the number of map/reduce steps and serialization boundaries. This optimization is kind of the holy-grail of the whole optimization work. This is exactly the kind of optimization we couldn't do before, so users had to micro manage when they wrote .toTypedPipe or not.

With this change, if enabled by default, we will probably see many jobs reduce the number of MR-steps, and also the ergonomics are very nice since users can concern themselves more with the logic and less with details of the map-reduce model to get good performance.

please take a look @ianoc @fwbrasil

johnynek commented 6 years ago

By the way: keep in mind, I don't think users should write .toTypedPipe explicitly ever, but sometimes methods return TypedPipe[(K, V)] when it is really a joined or grouped thing. With the current optimization subsequent .group.sum could be done without an additional map-reduce step.

ianoc commented 6 years ago

I wonder has the main usage i've seen of toTypedPipe gone away with the changes you've made around the pair implicit class... i think it was all about making intellij happy with the type.

That said this is great, users returning complex grouping types from functions so that they didn't hit extra steps if they combined them was rather unfortunate leak of internal scalding types. With this i think users in those cases can just always return the typed pipe. if things can stay in a single step they will... which is super neet for cleaner/more performant code

(lgtm)

johnynek commented 6 years ago

Thanks for the review @ianoc, merging with develop to make sure the tests are still green then will merge.

fwbrasil commented 6 years ago

I'm reviewing right now

johnynek commented 6 years ago

We are green.

Note we have not seen the MapGroup serialization failure of #1795 since the fix of #1827 -- so, that is looking more and more good with each passing CI.

fwbrasil commented 6 years ago

okay, I'm stopping the review

johnynek commented 6 years ago

Sorry I missed that you were reviewing @fwbrasil if you have any comments or changes, let me know, I can send a follow up PR.