twitter / summingbird

Streaming MapReduce with Scalding and Storm
https://twitter.com/summingbird
Apache License 2.0
2.14k stars 267 forks source link

Introduce `Topology` class #731

Closed ttim closed 7 years ago

ttim commented 7 years ago

This PR introduces Topology class which corresponds to Storm's dag with nodes and edges.

We firstly build Topology instance out of planned Dag and then build PlannedTopology out of it. At the moment of topology building we verify that edges in topology are compatible with each other and output format on outcoming edges is the same.

johnynek commented 7 years ago

I'll look at this.

One question: is there a design doc I can look at for the planned changes.

One feature of storm we have never leveraged is that each bolt can have multiple outputs. This can be a big win for reducing serialization. I wonder if an improved design can take advantage of this.

Also, we have long wanted the ability to do key-grouped joins, and possibly merge join + mapValues + sumByKey into one node. This could reduce communtication to stores because caching might be more effective.

would love to see what is in scope for the changes rather than just polishing to the next persons satisfaction (having seen now 3 prior refactorings to suit different folks taste).

ttim commented 7 years ago

@johnynek We tested grouped join internally and it looks very promising, so I'm making this feature production ready right now.

Basically grouped join change splits into two different things to do:

  1. being able to send (K, V) pairs not only in case of summers but also for regular flat maps (right now SB sends either values tuples or partially aggregated tuples),
  2. implement change to planner algorithm which creates separate flat map node with proper grouping for leftJoins.

In order to implement (1) I did this refactoring to make code which builds topology clearer and simpler to fix. Also I made it more error prone, for example, it fails at construction time when you build topology defined in https://github.com/twitter/summingbird/issues/725.

After this PR I'm going to do one more PR which implements exactly (1), and I hope it will not be complicated anymore. And after I'm going to do one more (last) PR which implements (2) on top of this changes, and which implements grouped join (but without hot key handling for now).

/cc @benpence @pankajroark

pankajroark commented 7 years ago

I totally agree about shipping incrementally, groupedJoin first. We can make other optimizations after shipping groupedJoin.

codecov-io commented 7 years ago

Codecov Report

Merging #731 into develop will increase coverage by 0.27%. The diff coverage is 92.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           develop    #731      +/-   ##
==========================================
+ Coverage    71.13%   71.4%   +0.27%     
==========================================
  Files          146     149       +3     
  Lines         3572    3627      +55     
  Branches       196     201       +5     
==========================================
+ Hits          2541    2590      +49     
- Misses        1031    1037       +6
Impacted Files Coverage Δ
...tter/summingbird/storm/builder/KeyValueSpout.scala 78.94% <ø> (ø)
...mmingbird/online/executor/OperationContainer.scala 33.33% <ø> (ø) :arrow_up:
...bird/storm/builder/AggregatorOutputCollector.scala 90.9% <ø> (ø)
...m/twitter/summingbird/storm/builder/EdgeType.scala 95% <100%> (ø)
...witter/summingbird/storm/FlatMapBoltProvider.scala 100% <100%> (+1.44%) :arrow_up:
...twitter/summingbird/storm/SummerBoltProvider.scala 100% <100%> (ø)
...m/twitter/summingbird/storm/builder/Topology.scala 100% <100%> (ø)
.../com/twitter/summingbird/storm/StormPlatform.scala 64.28% <100%> (ø) :arrow_up:
.../com/twitter/summingbird/storm/SpoutProvider.scala 97.43% <100%> (+7.23%) :arrow_up:
...itter/summingbird/storm/builder/EdgeGrouping.scala 66.66% <66.66%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update fb7ce9c...70ac17a. Read the comment docs.

ttim commented 7 years ago

@johnynek What do you think about this change now? I specialized types where I was able to.

ttim commented 7 years ago

@pankajroark @johnynek I did two more changes on top of what I had before:

  1. extracted SummerBoltProvider into separate class, to be consistent with Source and FlatMap logic https://github.com/twitter/summingbird/pull/731/commits/ad1b2f9b96782748b679fcacb4ec4fedd7edc25d
  2. put all common Spout/Bolt types into separate place, I believe it increased readability a lot https://github.com/twitter/summingbird/pull/731/commits/8242812d3fe9e5b1b819bb64f8ee9aae3209271f

I think this change is ready to land, what do you think?

ttim commented 7 years ago

@johnynek What I really like about this change is fact it's making planner exceptions clear based on type signatures here https://github.com/twitter/summingbird/pull/731/files#diff-ada7a53f54309669e9f2c41907787a5cR23. From this point it's clear that planner doesn't work in case of branches with flatMaps & sumByKeys. And with two subsequent sumByKeys also.

I'm preparing PR which fixes all this errors as well as introduce proper (K, V) handling (which enables us to do grouped join).

ttim commented 7 years ago

@johnynek Finally green build =) I guess it's safe to merge.

Interestingly one of storm tests (which do topology building for random producers) started to fail, but it's as expected (we just added more asserts) and I'm going to fix this in next PR.

ttim commented 7 years ago

@pankajroark @johnynek Thank you folks for reviews.

johnynek commented 7 years ago

👍 On Tue, Jun 13, 2017 at 17:56 Timur notifications@github.com wrote:

Merged #731 https://github.com/twitter/summingbird/pull/731.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/twitter/summingbird/pull/731#event-1122557775, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEJdqiWe5pJXgY7AysEnYP3QAUOA6Ovks5sD1m2gaJpZM4N0qpY .