twitter / summingbird

Streaming MapReduce with Scalding and Storm
https://twitter.com/summingbird
Apache License 2.0
2.14k stars 267 forks source link

Adding crush down capability into the SourceNode #682

Closed NPraneeth closed 8 years ago

NPraneeth commented 8 years ago

Closes #679 This PR mainly adds the summing capability to the spout. When FMMergeableWithSource is opted in, the flatMap operation in the spout creates a fanOut. The fanOut is being crushed down by using the capability of AsyncSummers inside the spout and emitting only crushed tuples. The spout-side aggregation metrics are added to the spout for performance analysis. The messageIds are also aggregated along with the tuples and ack/fail back is sent to all the aggregated messageIds. Tuples from different streams are aggregated individually.

johnynek commented 8 years ago

I'd really like to see us improve the type-safety of this code. There is a lot of casting that the next person to work on this may not be able to keep straight in their head as to why they should expect them to work.

Ideally, we have no casting in the system. That is usually not attainable when dealing with an untyped library like storm, but still we can try to narrow where the casts happen as much as possible.

NPraneeth commented 8 years ago

@johnynek : Sure.. will modify the code to make it type safe. Will add comments at places justifying the casting.

johnynek commented 8 years ago

basically my comments are:

  1. only use var as a total last resort that is necessitated by benchmarked performance concerns. Otherwise the state of this system gets really hard for us to keep track of fast.
  2. minimize casting. I don't see too many places we can improve here due to storm, but always good to be very skeptical of a cast.
johnynek commented 8 years ago

just a couple of minor comments. Seems like we are getting close.

johnynek commented 8 years ago

👍 I'm happy to delegate the Async discussion to @pankajroark and @jnievelt . I don't think the risk is sky high in either case, and we can follow up quickly to improve as needed.

pankajroark commented 8 years ago

Looks good to ship from my point of view after taking care of the new comments that Joe added. Praneeth shared with me, offline, the results of running a production job with and without async caching. The numbers didn't show async caching version to be worse. I think we have consensus on the fact that blocking on results of Async caches is not a good idea, but at least in one production job it didn't show a clear negative effect, so I'm ok with shipping this as is

jnievelt commented 8 years ago

+1 LGTM