twitter / summingbird

Streaming MapReduce with Scalding and Storm
https://twitter.com/summingbird
Apache License 2.0
2.14k stars 267 forks source link

AggregatorOutputCollector emits too many message IDs #692

Closed jnievelt closed 7 years ago

jnievelt commented 7 years ago

The current logic is something like:

Of course, if we have two colliding keys being summed, and an AsyncSummer implementation that doesn't do all-or-nothing flushes on addAll/etc., an emit of one key would include the message IDs for the aggregations on the other key as well.

I think I missed this in review, writing it off as okay because we are doing synchronous handling on the addAll result. But actually we still need to track the message IDs along with the aggregated values, a la FinalFlatMap and Summer.

I have the fix ready, originally thinking the change was only needed to do Future queuing, but I can split it out since they're really independent.

pankajroark commented 7 years ago

Good catch. IIUC this can happen if one message id can result in multiple keys, is that the case you're referring to?

jnievelt commented 7 years ago

I think it just needs to be a partial flush on the AsyncSummer. I forget offhand if any of the ones in algebird currently do this, but I should be able to start the PR with a test that fails because of this.

johnynek commented 7 years ago

One thing to keep in mind are diamonds: due to caching and map-side combining, one output may correspond to more than one input. So, when that output is successful or fails, we need to fail/succeed all the tuple IDs to that point.

Then intent is to track that using standard algebird approaches and just merge the sequence of ids.

sriramkrishnan commented 7 years ago

CC @ttim