Mitigate TCP Retransmissions

I've been thinking that performance should be better- I expect it to be more in the 10K RPS range than the 1K RPS range for a fast machine.

The way that we read TCP sockets is slightly out of the norm. Currently, we only read one MySqlPacket at a time from the socket. When dealing with result sets, this is equivalent to a single row. Result set rows are usually pretty small, so this means we are calling into the OS's socket API for every single row in a result set.

We may call socket.read a handful of times before we ever read a single TCP packet. This means that the TCP packet is sitting in the OS's socket buffer, taking up space. The socket buffer is unable to accept more packets until it has been completely read.

When activity is high, and many MySQL connections are reading a lot of data, the OS socket buffer is sitting full while connections read from it row by row. MySQL may try to send more data, but when the socket buffer is full, these packets get dropped and cause a TCP retransmission. TCP retransmissions are bad for performance.

To test this theory, I ran our stress tests and checked retransmissions after each test:

# targets-async2.txt just stresses the GET /api/async endpoint
cat vegeta/targets-async2.txt 
GET http://localhost:5000/api/async

# check retransmissions
 netstat -s | grep retrans
    60320 segments retransmited

# use normal async targets to seed the database
./stress.sh 1000 1s async
Latencies     [mean, 50, 95, 99, max]  2.105138734s, 2.224804101s, 2.905801837s, 3.17405192s, 3.462670324s

# check retransmissions
 netstat -s | grep retrans
    60322 segments retransmited
# caused 2 retransmissions

# use async2 targets to stress 3000 GETs
./stress.sh 3000 1s async2
Latencies     [mean, 50, 95, 99, max]  8.101027972s, 8.310701883s, 12.76829299s, 13.952514362s, 13.981650988s

 netstat -s | grep retrans
    60380 segments retransmited
# caused 78 retransmissions

# use async2 targets to stress 3000 more GETs
./stress.sh 3000 1s async2
Latencies     [mean, 50, 95, 99, max]  5.509762976s, 5.698767446s, 9.626680815s, 9.804948553s, 9.97223801s

 netstat -s | grep retrans
    60794 segments retransmited
# caused 414 retransmissions

I wanted to see what other MySQL drivers out there could do, so I coded the GET /api/async endpoint in Go. It has a lot less retransmissions and is a lot faster:

# check retransmissions
 netstat -s | grep retrans
    60794 segments retransmited

# use async2 targets to stress 3000 GETs
./stress.sh 3000 1s async2
Latencies     [mean, 50, 95, 99, max]  10.404817ms, 448.391µs, 70.46273ms, 115.09937ms, 139.388358ms

 netstat -s | grep retrans
    60800 segments retransmited
# caused 6 retransmissions

# use async2 targets to stress 3000 more GETs
./stress.sh 3000 1s async2
Latencies     [mean, 50, 95, 99, max]  6.205141ms, 426.545µs, 46.407889ms, 83.168218ms, 114.6755ms

 netstat -s | grep retrans
    60812 segments retransmited
# caused 12 retransmissions

# use async2 targets to stress 10000 more GETs
./stress.sh 10000 1s async2
Latencies     [mean, 50, 95, 99, max]  669.55583ms, 758.210327ms, 991.920506ms, 1.035882087s, 1.266103638s

 netstat -s | grep retrans
    62321 segments retransmited
# caused 1509 retransmissions

The point of this comparison is not to say one language is better than another, it is to prove that the MySQL server can serve 10K requests in 1 second. I believe that async C# code should also be able to handle 10K requests in 1 second. Go uses M:N threading and async I/O, just like async C#.

I think that part of this performance improvement will come from making our library more "TCP friendly", and freeing up the OS Socket Buffer.

mysql-net / MySqlConnector

Mitigate TCP Retransmissions #117