openzipkin-attic / zipkin-azure

Reporters and collectors for use in Azure
Apache License 2.0
11 stars 9 forks source link

Possible bug with adding up message counts? #16

Open aliostad opened 7 years ago

aliostad commented 7 years ago

This code looks like a bug to me but not sure the intended purpose hence asking:

countSinceCheckpoint += buffer.size();

https://github.com/openzipkin/zipkin-azure/blob/master/collector/eventhub/src/main/java/zipkin/collector/eventhub/ZipkinEventProcessor.java#L70

So let's say we have 5 spans in first message and 6 in the second, my countSinceCheckpoint will be 16 (5+11) rather than 11. I think just comparing with buffer.size() should be enough.

codefromthecrypt commented 7 years ago

looks sensible. want to add a test and fix?

https://github.com/openzipkin/zipkin-azure/blob/master/collector/eventhub/src/test/java/zipkin/collector/eventhub/ZipkinEventProcessorTest.java#L72

codefromthecrypt commented 7 years ago

ps I have work in progress on this. The SDK is a bit of a bear to work with as it is signed, etc, which makes it hard to mock certain things.

codefromthecrypt commented 7 years ago

https://github.com/openzipkin/zipkin-azure/pull/17

codefromthecrypt commented 7 years ago

cutting as 0.1.2 since I presume not checkpointing properly is a fairly major bug!

@aliostad thanks for discovering this

aliostad commented 7 years ago

I am sorry I have to open this issue again, since the problem is closely related to the same code.

It is not really our fault (I believe MS samples are wrong too) but basically the host creates a single IEventProcessor (in our case ZipkinEventProcessor) but the onEvents gets called per each partition, on different threads. Hence keeping a single count for ZipkinEventProcessor does not cut it.

Let me raise this with Microsoft before making any changes but my understanding is to keep a thread-safe dictionary of counts to solve.

codefromthecrypt commented 7 years ago

k sounds sensible, albeit possibly tricky.

if it ends up that way, we'd need to know if partitions come and go a lot. I thought I saw something about only 4 partitions in some thread someplace. Anything that helps fix the keys can help us avoid a memory leak when mapping on partition.

If you can, please ping MS to make a test version of their service as this sort of issue is a good example of where flying blind can hurt.

anyway, thanks for raising this and keep us posted!