How does SummingBird optimize the topology submitted to storm?

twitter / summingbird

Streaming MapReduce with Scalding and Storm

https://twitter.com/summingbird

Apache License 2.0

2.14k stars 267 forks source link

How does SummingBird optimize the topology submitted to storm? #585

Open kandu009 opened 9 years ago

kandu009 commented 9 years ago

Hi @johnynek ,

In one of the SummingBird talks, I see that you have mentioned something about optimization of a topology which will be submitted by SummingBird which is not possible by using Storm alone.

Video : http://goo.gl/bqai75

Could you please point me to some more information regarding the same ? A link to Github repository which is responsible for this part would be ideal.

Thanks !

johnynek commented 9 years ago

@ianoc can probably more quickly summarize.

What goes on here are two main things:

1) Minimize the number of bolts needed for the topology based on the logical Producer graph. This minimizes serialization costs which are substantial in many cases. This is the planner logic that runs since we know how to combine the summingbird operations (they are not black boxes, like storm bolts are).

2) We implement a batching layer above storm since we have seen issues with performance when emitting too many items into the output collectors. Instead, we emit Lists of data, which are all being sent to the same node. To storm, this looks like one large tuple so it stresses some of the queueing systems in storm less.

kandu009 commented 9 years ago

@johnynek Thanks for the details. Could you please point me to the place where 2 -> implementing a batch layer on storm is done?

Thanks !

johnynek commented 9 years ago

I'll be really impressed if you follow this code, but here is where this happens:

https://github.com/twitter/summingbird/blob/develop/summingbird-online/src/main/scala/com/twitter/summingbird/online/executor/FinalFlatMap.scala#L57

note we are emitting a storm Values with 2 elements, an Int and Map[K, V]. The Map represents the aggregated keys and values for this mini-batch. The Int is an identifier derived from the hashcode of all the K in that batch.

The unfortunate fact is this code was optimized against huge topologies that we had to keep operational as traffic grew and as such there are some rather convoluted performance hacks glued in under the covers.

The storm planning code is overdue for a refactoring, but it is not very high priority for us at the moment since it is working well internally at very high throughput.