twitter / scalding

A Scala API for Cascading
http://twitter.com/scalding
Apache License 2.0
3.5k stars 706 forks source link

Add WritePartitioner to break large flows up #1800

Closed johnynek closed 6 years ago

johnynek commented 6 years ago

Here is some code that we want to use internally inside giant summingbird flows to break up a bunch of writes into a smaller set.

It probably isn't totally optimal yet, but it seems to do the trick of breaking things up.

Since we don't have any fancy recursion schemes going on, there is a lot of code duplication from a very similar, but not super trivial to share, function called toLiteral in the OptimizationRules.

@ianoc and @erik-stripe @non can you both give feedback?

johnynek commented 6 years ago

@ianoc I added a test that we only have 1 more step than cascading by doing this (the trailing map that we don't yet pull up). If we nailed that, we could turn this on by default, or add a config flat to automatically do this inside of executions.... for now I'll let us get more experience with it.

non commented 6 years ago

It's great to see this Materializer type class being used to write those tests!

Also great use of EqTypes (i.e. Leibniz).

👍

ianoc commented 6 years ago

👍

lgtm not in the critical path so any iterations to perf whatever can be done on develop i think. This looks to have good test coverage/very useful features.

(side interesting benefit is we could actually do some absolute crazyness and if the ordered serializers have been registered for a type use those at our materialization points now... since the signatures don't include them it would need to be some sort of i guess cache.... but be an interesting experiment)