Closed gregschrock closed 6 years ago
@gregschrock thank you for the great bug report.
I am going to look into it as soon as I can, although, I am currently working on some major improvements to TANK, so it may take a while until this new branch is pushed upstream. Specifically, a few new optimizations for improving consumer requests responsiveness for "tailing" semantics, and, also, TANK is going to be cluster-aware ( similar to Kafka in terms of semantics, but hooking to Consul instead of ZooKeeper ).
Hey @markpapadakis,
I spent some time investigating this issue today and found its source. This consumption optimization is problematic when process_consume
is called multiple times within a single poll. The range created by {consumptionList.data(), cnt}
will still point to the data of consumptionList
when it (consumptionList
) is reused in subsequent calls. That is, it's a pointer to the consumption list which is now holding messages for a different partition. Ultimately, all partition content created through the optimization will share some number of messages. The exact number and messages will vary based on how many messages each partition contains and in what order the responses are processed.
My test passed when I disabled that optimization, but I'm not sure what the long term fix should be. The optimization is still valid for the last processed partition, but it seems impossible to know that the last partition (of all consume responses rather than just the current one) is being processed.
Let me know what you think the solution should be and I can put up a PR if you like.
Good morning @gregschrock :)
I should be fixed now, please update from the repo and try again.
It was actually bit more subtle. consumptionList
shouldn't have been clear()ed in that method context, and each (topic, partition) message set consumed was not appropriately tracked.
Thank you very much for your report. If everything's fine, please close this issue. The new TANK major release should a while because I am tied on other things, but it shouldn't take too long.
Thanks so much @markpapadakis!
The quick turnaround is appreciated! I applied your patch, and our related tests are passing.
However, I'm finding that the next.seqNum
has been impacted. In my test above, updating the expectations with:
if (partitionContent.clientReqId == consumeTopic1Id)
{
EXPECT_STREQ(topic1Message.c_str(), receivedMessage.c_str());
EXPECT_EQ(2, partitionContent.next.seqNum); // <- added expectation
}
produces a failure where partitionContent.next.seqNum
is 0 instead of 2.
Thank you @gregschrock. I 'll look into it within an hour and will post an update
I tried and couldn't reproduce that(thousands of runs). Any chance you didn't erase the topic/partition data before you ran the tests?
Okay, it must be something in my build. We don't have C++17 support, so I've needed to make minimal changes to for loops etc. in order to be C++14 compatible. But the patch applied and built cleanly, so I expected everything to be good. I must have introduced this issue somehow. Sorry to take your time.
Okay, I've got the conclusion. I had pulled in Fixes #64, but not Fixed tiny client issue/type. The latter addressed the next index computation issue from the first. All's good now.
Thank you very much :)
Hello,
My company has been using TANK to great success in a recent project. Thank you!
I recently uncovered a bug where consuming from different topics at the same time results in the same message being returned for both request IDs. I've created a (fairly) minimal test to demonstrate the issue.
I'm using the google testing framework, but the test can be run otherwise with very minimal changes. It does assume that TANK is running and that "topic1" and "topic2" exist with a single partition for each. I can update this snippet to create the topics through the client if that would be helpful.
The test fails at the snippet:
with the response matching
consumeTopic1Id
having a messageI am from topic 2
.Any insight you can give me into this behavior would be much appreciated.