twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.48k stars 703 forks source link

Solve optimization of zipped executions #1911

Closed johnynek closed 5 years ago

johnynek commented 5 years ago

previously we could only merge two writes if they were directly zipped. If you zip with Execution.unit you ruin this behavior.

Secondly, we weren't applying execution optimizations after flatmap-like functions, which reduces the effectiveness if there are any flatMaps in your system.

We hit a bug that we really wanted to merge Executions but it was not happening. This should fix it for us.

@ttim @dieu can you all take a look?

dieu commented 5 years ago

looks good from my side.

cc @ttim

johnynek commented 5 years ago

let's not merge just yet, I have some tests I want to add, I think this still may not be enough.

I'd like to finally get this solved for good.

johnynek commented 5 years ago

@dieu @ttim please take another look.

I found a failing unit test with my previous approach, and I realized it was still playing whack a mole. So, I took a step back, and thought about how to totally solve the problem.

While much more complex, I believe the current PR totally solves the problem of merging WriteExecutions. I think we now have the property that (other than flatMaps) there is at most 1 WriteExecution in each Execution. Since we apply the optimizations after each flatMap application now, I think this is the best we can do.

I've tried to comment the code, but feel free to ask any questions.

oscar-stripe commented 5 years ago

I think this is good to go. Any comments @ttim @dieu ?

johnynek commented 5 years ago

Pinging @ttim you haven’t weighed in yet.

I think this is a big improvement which I hope can really improve plans for many users.

ttim commented 5 years ago

@johnynek I'm going through it right now. It would be nice to separate change where you apply execution optimizations of flatMaps & co and where actual zip optimization happens, but this is optional.

johnynek commented 5 years ago

Simplified after talking to @ttim -- I think @ttim may have an even simpler approach which may follow up with as a PR, but this seems like strictly better than our current situation.