sociomantic-tsunami / dlsproto

Distributed Log Store protocol definition, client, fake node, and tests
Boost Software License 1.0
3 stars 18 forks source link

DoubleBuffering implementation in GetRange sometimes hangs #62

Closed nemanja-boric-sociomantic closed 6 years ago

nemanja-boric-sociomantic commented 6 years ago

It looks like that #60 introduces a bug where on rare occasions not all connections are able to finish the GetRange request. This doesn't seem trivially reproducible, but it requires a lot of data.

nemanja-boric-sociomantic commented 6 years ago

OK, I've found a clue. It always happens when the client receives 0 records from the node (for some reason):

2018-02-22 11:54:18,682 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Wait for Records loop entered
2018-02-22 11:54:18,682 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: I'll ask for the next batch       
2018-02-22 11:54:18,683 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Asked for the next batch          
2018-02-22 11:54:18,684 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Processed data, setting length to zero
2018-02-22 11:54:18,684 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: this.stopped = false, this.buffers.empty() = true
2018-02-22 11:54:18,684 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: inside wait for records!          
2018-02-22 11:54:18,684 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: I'll suspend, the buffers are empty
2018-02-22 11:54:19,540 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Reader: got records!
2018-02-22 11:54:19,540 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Adding records: 100045            
2018-02-22 11:54:19,540 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Let's swap the buffer.            
2018-02-22 11:54:19,540 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: this.stopped = false, this.buffers.empty() = false ret = true
2018-02-22 11:54:19,540 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Wait for Records loop entered     
2018-02-22 11:54:19,540 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: I'll ask for the next batch
2018-02-22 11:54:19,540 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Asked for the next batch          
2018-02-22 11:54:19,540 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Reader: got records!
2018-02-22 11:54:19,541 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Adding records: 0
2018-02-22 11:54:19,542 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: Processed data, setting length to zero
2018-02-22 11:54:19,542 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: this.stopped = false, this.buffers.empty() = true
2018-02-22 11:54:19,542 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: inside wait for records!          
2018-02-22 11:54:19,542 Info [dlsproto.client.request.internal.GetRange] - xxx: #302: I'll suspend, the buffers are empty

and this is where the request connection 302 hangs.

Same happens with the two connections in a different run:

2018-02-21 16:45:55,952 Info [dlsproto.client.request.internal.GetRange] - xxx: Reader: got records!
2018-02-21 16:45:55,952 Info [dlsproto.client.request.internal.GetRange] - xxx: Adding records: 0
2018-02-21 16:45:55,953 Info [dlsproto.client.request.internal.GetRange] - xxx: Processed data, setting length to zero
2018-02-21 16:45:55,953 Info [dlsproto.client.request.internal.GetRange] - xxx: this.stopped = false, this.buffers.empty() = true
2018-02-21 16:45:55,953 Info [dlsproto.client.request.internal.GetRange] - xxx: inside wait for records!
2018-02-21 16:45:55,953 Info [dlsproto.client.request.internal.GetRange] - xxx: I'll suspend, the buffers are empty
2018-02-21 16:45:54,736 Info [dlsproto.client.request.internal.GetRange] - xxx: Reader: got records!
2018-02-21 16:45:54,736 Info [dlsproto.client.request.internal.GetRange] - xxx: Adding records: 0
2018-02-21 16:45:54,739 Info [dlsproto.client.request.internal.GetRange] - xxx: Processed data, setting length to zero
2018-02-21 16:45:54,739 Info [dlsproto.client.request.internal.GetRange] - xxx: this.stopped = false, this.buffers.empty() = true
2018-02-21 16:45:54,739 Info [dlsproto.client.request.internal.GetRange] - xxx: inside wait for records!
2018-02-21 16:45:54,739 Info [dlsproto.client.request.internal.GetRange] - xxx: I'll suspend, the buffers are empty
nemanja-boric-sociomantic commented 6 years ago

OK, I've found the problem. The way how the RecordStream figures out that the next batch is received, is by looking into the number of the records in the batch. However, the node is completely free to send the batch of 0 records (for the protocol symmetry). So when the node sends the batch of 0 records, the number of the records will still be 0, although there's a batch to process. Of course, nothing would be processed, but the client would ack the receiving of the batch via Continue code.

So, the clients receives the empty batch and waits for another (without asking the node with Continue) and node expects that Continue to send the final Finished message.

nemanja-boric-sociomantic commented 6 years ago

Fixed in #64