twitter / summingbird

Streaming MapReduce with Scalding and Storm
https://twitter.com/summingbird
Apache License 2.0
2.14k stars 267 forks source link

Configurable elimination of FlatMapNode by pushing the flatMap logic into the SourceNode. #675

Closed NPraneeth closed 8 years ago

NPraneeth commented 8 years ago

Currently we have the FlatMapNode created for a simple flatMap operation. The intent is to let the spout handle the flatMap operation just like the optionMap operation.

We are considering SourceNode aggregation out of scope for this issue, so topologies without any FlatMapNodes will do all of their aggregation in the SummerNode.

johnynek commented 8 years ago

I like this project, but it might be a non-negligible change to the design.

We could imagine making all tuples (K, V) pairs but doing a shuffle grouping between mapping nodes. This means you don't have two different input/output formats. In the case of no key, we can put () which is serialized in 0 bytes by chill (but still takes a little space in a tuple by making a longer List to wrap.

That might make things easier since there is not this odd-ball node that has to prepare a special wire format in front of the summer.

jnievelt commented 8 years ago

FWIW, the complexity will largely mimic what's in FlatMapBoltProvider today:

https://github.com/twitter/summingbird/blob/v0.11.0-RC1/summingbird-storm/src/main/scala/com/twitter/summingbird/storm/FlatMapBoltProvider.scala

Shimming everything into (K, V) is an interesting idea, but it does have the complexity of knowing when to add/discard the (), and possibly handling the (contrived?) case where V actually is Unit. So maybe its benefit isn't that much overall?

NPraneeth commented 8 years ago

@johnynek : I have made all the changes to the PR considering all the comments. Can you take a look ?