Closed ttim closed 7 years ago
I'll look at this.
One question: is there a design doc I can look at for the planned changes.
One feature of storm we have never leveraged is that each bolt can have multiple outputs. This can be a big win for reducing serialization. I wonder if an improved design can take advantage of this.
Also, we have long wanted the ability to do key-grouped joins, and possibly merge join + mapValues + sumByKey into one node. This could reduce communtication to stores because caching might be more effective.
would love to see what is in scope for the changes rather than just polishing to the next persons satisfaction (having seen now 3 prior refactorings to suit different folks taste).
@johnynek We tested grouped join internally and it looks very promising, so I'm making this feature production ready right now.
Basically grouped join change splits into two different things to do:
(K, V)
pairs not only in case of summers but also for regular flat maps (right now SB sends either values
tuples or partially aggregated tuples),leftJoin
s.In order to implement (1) I did this refactoring to make code which builds topology clearer and simpler to fix. Also I made it more error prone, for example, it fails at construction time when you build topology defined in https://github.com/twitter/summingbird/issues/725.
After this PR I'm going to do one more PR which implements exactly (1), and I hope it will not be complicated anymore. And after I'm going to do one more (last) PR which implements (2) on top of this changes, and which implements grouped join (but without hot key handling for now).
/cc @benpence @pankajroark
I totally agree about shipping incrementally, groupedJoin first. We can make other optimizations after shipping groupedJoin.
Merging #731 into develop will increase coverage by
0.27%
. The diff coverage is92.52%
.
@@ Coverage Diff @@
## develop #731 +/- ##
==========================================
+ Coverage 71.13% 71.4% +0.27%
==========================================
Files 146 149 +3
Lines 3572 3627 +55
Branches 196 201 +5
==========================================
+ Hits 2541 2590 +49
- Misses 1031 1037 +6
Impacted Files | Coverage Δ | |
---|---|---|
...tter/summingbird/storm/builder/KeyValueSpout.scala | 78.94% <ø> (ø) |
|
...mmingbird/online/executor/OperationContainer.scala | 33.33% <ø> (ø) |
:arrow_up: |
...bird/storm/builder/AggregatorOutputCollector.scala | 90.9% <ø> (ø) |
|
...m/twitter/summingbird/storm/builder/EdgeType.scala | 95% <100%> (ø) |
|
...witter/summingbird/storm/FlatMapBoltProvider.scala | 100% <100%> (+1.44%) |
:arrow_up: |
...twitter/summingbird/storm/SummerBoltProvider.scala | 100% <100%> (ø) |
|
...m/twitter/summingbird/storm/builder/Topology.scala | 100% <100%> (ø) |
|
.../com/twitter/summingbird/storm/StormPlatform.scala | 64.28% <100%> (ø) |
:arrow_up: |
.../com/twitter/summingbird/storm/SpoutProvider.scala | 97.43% <100%> (+7.23%) |
:arrow_up: |
...itter/summingbird/storm/builder/EdgeGrouping.scala | 66.66% <66.66%> (ø) |
|
... and 8 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update fb7ce9c...70ac17a. Read the comment docs.
@johnynek What do you think about this change now? I specialized types where I was able to.
@pankajroark @johnynek I did two more changes on top of what I had before:
SummerBoltProvider
into separate class, to be consistent with Source
and FlatMap
logic https://github.com/twitter/summingbird/pull/731/commits/ad1b2f9b96782748b679fcacb4ec4fedd7edc25dI think this change is ready to land, what do you think?
@johnynek What I really like about this change is fact it's making planner exceptions clear based on type signatures here https://github.com/twitter/summingbird/pull/731/files#diff-ada7a53f54309669e9f2c41907787a5cR23. From this point it's clear that planner doesn't work in case of branches with flatMap
s & sumByKey
s. And with two subsequent sumByKey
s also.
I'm preparing PR which fixes all this errors as well as introduce proper (K, V)
handling (which enables us to do grouped join).
@johnynek Finally green build =) I guess it's safe to merge.
Interestingly one of storm tests (which do topology building for random producers) started to fail, but it's as expected (we just added more asserts) and I'm going to fix this in next PR.
@pankajroark @johnynek Thank you folks for reviews.
👍 On Tue, Jun 13, 2017 at 17:56 Timur notifications@github.com wrote:
Merged #731 https://github.com/twitter/summingbird/pull/731.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/twitter/summingbird/pull/731#event-1122557775, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJdqiWe5pJXgY7AysEnYP3QAUOA6Ovks5sD1m2gaJpZM4N0qpY .
This PR introduces
Topology
class which corresponds to Storm's dag with nodes and edges.We firstly build
Topology
instance out of plannedDag
and then buildPlannedTopology
out of it. At the moment of topology building we verify that edges in topology are compatible with each other and output format on outcoming edges is the same.