salesforce / storm-dynamic-spout

A framework for building spouts for Apache Storm and a Kafka based spout for dynamically skipping messages to be processed later.
BSD 3-Clause "New" or "Revised" License
40 stars 13 forks source link

Add consumer id to the Record object #44

Closed stanlemon closed 6 years ago

stanlemon commented 6 years ago

Resolves #41

stanlemon commented 6 years ago

@Crim Interestingly when it comes out of the Vspout the MessageId has this info, we just don't have it further upstream. I kind of went back and forth on this, but the potential use case is when tracking offsets like we were doing in our debug scenario and you have sidelines going. It could give you the ability to introspect the source of the message a little more closely inside of the Deserializer. Maybe this is over kill though, I'm not sure. What do you think?

Also worth noting, this will break different Consumer's we've built. There are a number of 0.10 breaking changes so that's not the end of the world, just something to keep in mind.

Crim commented 6 years ago

I'm curious to know more about the use case here. I'm wondering if this encourages tightly coupling downstream logic in the topology to internals of the spout, which feels "Not Good" (tm)

On the other hand, I could see this being useful in the context of being able to emit the tuple down a different stream out of the spout based on the virtual spout it originated from, but allowing that to happen is a larger issue in itself.

stanlemon commented 6 years ago

@Crim regarding use case, I think the idea here is that when you deserialize in the consumer you could see hey, is this from a sideline or not? There's not really facilities inside of VirtualSpout to do that sort of thing. That said, the deserializer is Kafka-specific so maybe we're leaking concerns here, I'm not sure.

@danielsd do you have any additional thoughts on this one?

daniel-dara commented 6 years ago

It does seem like if you need to know the consumer, then maybe you shouldn't be using the dynamic spout, but that's not a choice when it comes to sidelining. I'm not sure we have a use case ourselves yet, but I think it's reasonable that a topology know which tuples came from a sideline spout so they can be processed differently if necessary.

stanlemon commented 6 years ago

Not 100% sure I agree. Part of the goal of sidelining is so that you don't have to worry what spout something originates from. The more I think about this the less I am sold that we should add this API surface area without a clear use case to back it up. Adding this parameter requires every implementing interface to pass this data. Not sure that's a necessary requirement just yet. If we can come up with a good use case for adding this I'm game, but I think for now I'm going to sideline (har har) this PR until we do.