trilogy-libraries / trilogy

Trilogy is a client library for MySQL-compatible database servers, designed for performance, flexibility, and ease of embedding.
MIT License
697 stars 68 forks source link

Fix "Got packets our of order" errors on connect #184

Closed skipkayhil closed 4 months ago

skipkayhil commented 4 months ago

When trying to upgrade to Trilogy 2.8.0, my application started raising errors when establishing connections to our database:

1156: Got packets out of order

Bisecting the changes between 2.7.0 and 2.8.0 indicated that the change to remove sequence number hardcoding introduced the bug.

Unfortunately, the issue was not easily reproducible locally, so I added a printf statement to try to understand what sequence number was being used instead of the previously hardcoded 1 (or 2). This log showed that in failure cases the sequence number was 0.

After that, adding more printfs helped provide a clearer picture of the sequence of events when the incorrect sequence number is used:

trilogy=init seq=0
trilogy-pre-try_connect seq=0
trilogy-connect_recv seq_was=0
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=0
trilogy-connect_recv seq_was=0
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=1
trilogy-connect_recv-seq_was=1
trilogy-connect_recv seq=0
trilogy-connect_recv-read_packet seq=0
trilogy-connect_recv-parse_handshake seq=0
trilogy=auth_send use_ssl=0, seq=0

My understanding of what happens here is that trilogy_connect_recv is getting called 3 times in the while loop inside try_connect.

The first time, it sets the packet_parser's sequence_number to 0 before it calls read_packet but only reads some amount up to the packet's sequence_number before exiting with TRILOGY_AGAIN.

trilogy_connect_recv gets called a second time, and again it sets the packet_parser's sequence_number to 0. This time read_packet consumes the packet's sequence_number and so packet_parser's sequence_number get incremented. However, this iteration also exits with TRILOGY_AGAIN before the packet is finished.

trilogy_connect_recv is then called a third time, and again it sets the packet_parser's sequence_number to 0. read_packet finally finishes consuming the packet and this iteration is finally able to parse the handshake from the completed packet.

Since the sequence_number is being reset on each loop, auth_send ends up incorrectly using a sequence number of 0 instead of 1 (or 2). Therefore it seems incorrect for each call to trilogy_connect_recv to reset the sequence_number since it is not guaranteed that the last call will be the one that consumes the handshake's sequence_numer and increments the value for use in auth_send.

skipkayhil commented 4 months ago

I'm not sure how best to test this πŸ˜…

My thinking would be a test could open a connection to a socket, write part of the handshake packet (past the seq number) to the socket, call trilogy_connect_recv, finish writing the handshake packet, call trilogy_connect_recv again, and then assert the seq number ends up as 1. However I'm not sure exactly how I would do that in the C test suite or if that's a reasonable test to write.

matthewd commented 4 months ago

I can't really see a way to end up in the situation where we'd re-call and end up anywhere near here while needing the reset (i.e. not on a fresh connection that's already zeroed).

But in principle, I think what makes the connect flow unique is the fact that its 'send' function doesn't actually send anything: I haven't checked the docs, but I guess the protocol says the server talks first.

So, every other packet type, prior to 'recv', gets its sequence reset by its builder, in its 'send', but this one does not.

Thus [even though I can't see any reason it wouldn't be safe to just delete it], I claim it would be most consistent for us to instead move this line up a bit, into trilogy_connect_send_socket: that's the moment that we send a zero-length sequence, skipping the builder involvement that would otherwise handle it.

skipkayhil commented 4 months ago

That makes sense to me πŸ‘ Happy to squash as well if you'd prefer

Edit: whoops, typo'd the field 🀦

eileencodes commented 4 months ago

Thanks @skipkayhil!