Performance gap between TiProxy and HAProxy grows as the dataset size grows

djshow832 commented 11 months ago

Problem

When running sysbench, the more data TiDB returns, the performance gap between TiProxy and HAProxy grows.

Test Result

Create a TiDB cluster with HAProxy and TiProxy, each of which has 2 CPU cores.
Run sysbench with --tables=10 --table-size=1000000 --threads=32 oltp_read_only --skip_trx=true --point_selects=0 --sum_ranges=0 --order_ranges=0 --distinct_ranges=0 --simple_ranges=1 --range_size={range_size}
Check the QPS and CPU of TiProxy and HAProxy.

Range size	QPS	HAProxy CPU	QPS per 100% CPU
10	32480	120%	27100
100	27462	140%	19600
1000	7420	110%	6740
10000	757	90%	840

Range size	QPS	TiProxy CPU	QPS per 100% CPU
10	30955	180%	17200
100	14655	190%	7710
1000	2112	200%	1060
10000	221	200%	110

As we can see, when the range size is 10, the performance of HAProxy is less than twice of TiProxy. But when the range size is 10000, the performance of HAProxy is almost 8 times of TiProxy.

Reason

In MySQL, each row is wrapped in a MySQL packet.

TiProxy reads data, allocates memory, and writes data for each MySQL packet because TiProxy needs to check the status.
HAProxy simply forwards data from the server to the client and doesn't need to parse MySQL packets.

Thus, TiProxy is more impacted by the row count.

Code Analysis

The flame graph when range size is 1000:

WritePacket and ReadPacket become the hot path, so this code should be optimized:

WriteOnePacket calls 2 NewReader and that allocates memory twice.
ReadPacket grows slice in data = append(data, buf), which is unncessary.
var header [4]byte escapes to the heap because header is passed as a parameter to other functions.
The read/write buffer size is 16K, which means for 1000 rows, it calls syscall read/write 8 times.
io.ReadFull contains unnecessary boundary check, function calls and interface conversion.

xhebox commented 11 months ago

I believe that this could be solved by processing mysql packet instead of mysql packet without header and with only body.

1,3. actor models like gnet typically have one global buffer per connection.

Maybe we should also check results of tracing.

djshow832 commented 11 months ago

I believe that this could be solved by processing mysql packet instead of mysql packet without header and with only body.

How?

1,3. actor models like gnet typically have one global buffer per connection.

Maybe we should also check results of tracing.

I checked it before but it's not easy to fix now.

xhebox commented 11 months ago

How?

Most packets are just forwarded without processing. Packets processed are somewhat special, e.g. they may never use more than one mysql packet to represent, or they are in the handshake process... etc.

That said, we could just forward the original mysql packet as is most of the time, instead of parsing the high level packet.

djshow832 commented 11 months ago

How?

Most packets are just forwarded without processing. Packets processed are somewhat special, e.g. they may never use more than one mysql packet to represent, or they are in the handshake process... etc.

That said, we could just forward the original mysql packet as is most of the time, instead of parsing the high level packet.

But we still need to parse the leader bytes of each packet to know whether there are more packets to return. Once the data is read packet by packet, there's little space to optimize.

xhebox commented 11 months ago

That said, we could just forward the original mysql packet as is most of the time, instead of parsing the high level packet.

But we still need to parse the leader bytes of each packet to know whether there are more packets to return. Once the data is read packet by packet, there's little space to optimize.

Yes. So buffering is needed. We only peek the header and then write the whole buffer... That is, however, easier to implement in actor models than the current models.

pingcap / tiproxy