snabbco / snabb

Snabb: Simple and fast packet networking
Apache License 2.0
2.96k stars 297 forks source link

ConnectX: Review 1xSQ transmit performance #1004

Open lukego opened 8 years ago

lukego commented 8 years ago

I have prepared an initial benchmark report on single-queue transmit performance with the ConnectX-4 100G NIC. The packet rates are quite modest. This may indicate that applications need to use many queues to make effective use of ConnectX-4.

download

I would like to validate these results before drawing firm conclusions about single-queue performance:

plajjan commented 8 years ago

I would have hoped that it would do considerably better, assuming that your app does "nothing" else besides sending packets. With some more packet processing I guess we'll see much lower performance. What does the receive performance look like? On par?

plajjan commented 8 years ago

Should the graph only show 8192 queue depth?

lukego commented 8 years ago

Should the graph only show 8192 queue depth?

The graphs for all queue sizes are being plotted. The thing is that the queue size has very little impact and so sometimes the lines are drawn directly on top of each other. You can only see the difference in the first graph which has a "zoomed in" Y-axis. (I should have made a note of this in the writeup.)

I would have hoped that it would do considerably better, assuming that your app does "nothing" else besides sending packets.

Braindump:

My working hypothesis is that the CPU is mostly idle and waiting for the NIC to process the transmit requests. I think this Snabb code can send packets at 100+ Mpps per core (hand-tuned load generator) but the NIC can only process around 9M requests on the Send Queue per second. I suspect that the ConnectX-4 is a "scale out" rather than "scale up" NIC i.e. that it processes queues slowly but can process many queues in parallel. (That would be a bit like Xeon CPUs where the high-end ones have lower per-core clock speeds but make up for that by having more cores.)

I want to test this theory by transmitting on many Send Queues at once. The prediction is that this will boost performance and reveal the maximum practical packet rate. I need to finish a rework of the transmit code before I can run this test.

If this does turn out to be the case then the situation will be a little different than we are used to. The Intel 82599 can handle line-rate on a single queue and only uses multi-queue to split up the traffic across CPU resources. This would be the opposite situation where we need to split up the traffic across NIC hardware resources. This would probably be fine for most applications but a PITA for some.

Could alternatively be that the benchmark is CPU-bound due to a bug in my load generator (it does do a little more work than the 82599 version that is tested at 200+ Mpps per core) or that there is a setting missing somewhere (e.g. on 82599 there was an obscure default register setting that needed to be changed to achieve > 12 Mpps per queue, see #628).

Early days... I will keep you posted.

lukego commented 8 years ago

@plajjan Sorry I realize this is turning into an annoying cliff-hanger :-).

I am running a new set of tests now that is like before but separately for 1..24 parallel send queues. I will post the full results when the tests finish. Quick happy-testing shows we hit at least 60 Mpps w/ one CPU core doing transmit. I will be curious to see where the sweet-spot combinations of SendQueues+QueueSize+PacketSize.

This is based on the early rough version of the TX code. I am debugging the new neat version of that in parallel.

plajjan commented 8 years ago

Heh, no problem. I'll await the test results :)

mwiget commented 8 years ago

@lukego: wow! Very impressive.