phaistos-networks / TANK

A very high performance distributed log service
Apache License 2.0
938 stars 70 forks source link

Problem with Client when receiving more than one consume message #66

Closed krconv closed 5 years ago

krconv commented 5 years ago

When the client receives multiple consume messages during a single poll() period, there seems to be a problem where some of the messages' contents returned by consumed() are corrupted/bad pointers. This can cause errors due to the corrupted char* data() pointer or size_t size() for consumed_msg, such as a SEGFAULT.

I've created this branch to reproduce the issue: https://github.com/krconv/TANK/tree/client-tests

git clone https://github.com/krconv/TANK/tree/client-tests
cd TANK
git checkout client-tests
docker build . -t tank-test
while [ $? -eq 0 ] ; do docker run -it tank-test ; done

My output:

Note: Google Test filter = ClientTest.CanConsumeSimultaneously
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from ClientTest
[ RUN      ] ClientTest.CanConsumeSimultaneously
static void client::test::ClientTest::createEmptyTankTopics()
static void client::test::ClientTest::startTank()
<=TANK=> v0.77 | 2 topics registered, 2 partitions; will listen for new connections at 127.0.0.1:11011
(C) Phaistos Networks, S.A. - http://phaistosnetworks.gr/. Licensed under the Apache License

tests/client-test.cpp:184: Failure
      Expected: 1
To be equal to: pMessage->seqNum
      Which is: 140039336233528
[segfault]
krconv commented 5 years ago

I believe that I've traced the issue down to this section of code in client.cpp:1976. In the corruption case, the messages pushed onto consumedPartitionContent end up getting overwritten when the next message is processed (i.e. when process_consumed is called again during the same poll()). I think a quick fix is to remove this optimization and always allocate a new space for the message, but I don't know how that would impact performance.

if (i == topicsCnt - 1 && k == partitionsCnt - 1) {
        // optimization
        consumedPartitionContent.push_back({clientReqId, topicName, partitionId, {consumptionList.data() + consumption_list_base, static_cast<unsigned int>(consumed_cnt)}, true, {next, lastPartialMsgMinFetchSize}});
} else {
        ...
}

I can provide the debugging information that I have if useful too.

markpapadakis commented 5 years ago

Thank you for the report. There is a new TANK server and client release currently under testing which fixes this and other problems reported by other users. It will be a few more days before it’s ready for a public release but it’s worrh it believe. Many new features and performance improvements.

markpapadakis commented 5 years ago

@krconv you can certainly explicitly disable that optimisation and it should work fine until the new release is pushed upstream

krconv commented 5 years ago

Thanks for the help!