ixgbe: strange packet loss when zerocopying packets on the same interface

vista- commented 7 years ago

I recently developed a small application using netmap which is meant to be used for VLAN translations. I am using the latest netmap pulled right from the repository, and the problem persisted since about February (before that, I have no idea).

The test environment looks something like this on my side:

+-----------+ 3000 +--------+           +--------+
|       eth3|----->|        |   trunk   |        |
|traffic gen|      | Switch | 3000,3001 | netmap |
|       eth4|<-----|        |<--------->|eth3    |
+-----------+ 3001 +--------+           +--------+

The hardware used for this test:

NICs in both computers are Intel 82599ES 10 GbE adapters (ixgbe)
Cisco WS-C3560E with spanning tree functionality disabled
i7-6700 CPU, 16GB of RAM in both PCs

For the traffic generator, I use pkt-gen. I have already confirmed that no (or negligible) loss occurs when using a modified version of both the bridge example application and my own program, which use two interfaces.

However, when I tried to adapt my code to use one single interface, and swap packets between RX and TX rings, I ran into quite a lot of performance problems. I have since been able to get the constant loss rate down to 0.5-1%, which seems to be constant. The loss happens even at 100 kpps (3.3% CPU usage) and is quite visible at 1 Mpps, the loss rate is around the same in both cases. I have idea how to proceed any further, if you could give me some pointers for debugging this problem, I would really appreciate it.

You can find my application here: https://github.com/vista-/nm_vlan_tr/blob/master/nm_vlan_tr.c I tried to make the application as minimal as possible, while still being easy to test.

During my tests, I compiled the application with gcc-6, and had the best results (0.5-1% loss) with -DIOCTL defined. I used the bridge and pkt-gen applications as a base reference for creating my application.

vmaffione commented 7 years ago

I think your program is calling NIOCTXSYNC without updating cur/head in the netmap ring, so that call has no effect (other than wasting CPU cycles).

vista- commented 7 years ago

Good point, I have implemented a fix for this (-DIOCTL_FIX - https://github.com/vista-/nm_vlan_tr/commit/7607c8571990930e1cf93b586e98e5fe5954c17c), doesn't seem to help with the packet loss, though. I found the loss rate to be a bit worse when calling NIOCTXSYNC only after setting the cur/head pointers.

Compare -DIOCTL:

494.808678 main_thread [2407] 991.460 Kpps [avg/std 992.414 K/1.742 K pps](992.484 Kpkts 508.152 Mbps in 1001033 usec) 58.34 avg_batch 356 min_space
495.809711 main_thread [2407] 990.200 Kpps [avg/std 992.068 K/1.630 K pps](991.224 Kpkts 507.507 Mbps in 1001034 usec) 58.82 avg_batch 376 min_space
496.810744 main_thread [2407] 995.097 Kpps [avg/std 992.153 K/1.747 K pps](996.124 Kpkts 510.015 Mbps in 1001032 usec) 58.87 avg_batch 372 min_space
497.811776 main_thread [2407] 994.691 Kpps [avg/std 992.381 K/1.838 K pps](995.719 Kpkts 509.808 Mbps in 1001033 usec) 58.02 avg_batch 348 min_space
498.812810 main_thread [2407] 994.740 Kpps [avg/std 992.613 K/1.900 K pps](995.769 Kpkts 509.834 Mbps in 1001034 usec) 57.92 avg_batch 348 min_space
499.813844 main_thread [2407] 995.188 Kpps [avg/std 992.862 K/1.975 K pps](996.216 Kpkts 510.063 Mbps in 1001033 usec) 58.19 avg_batch 368 min_space
500.814876 main_thread [2407] 994.968 Kpps [avg/std 993.125 K/1.976 K pps](995.996 Kpkts 509.950 Mbps in 1001033 usec) 57.49 avg_batch 368 min_space

with -DIOCTL_FIX:

238.278821 main_thread [2407] 986.600 Kpps [avg/std 982.692 K/3.088 K pps](987.619 Kpkts 505.661 Mbps in 1001033 usec) 56.43 avg_batch 372 min_space
239.279856 main_thread [2407] 986.470 Kpps [avg/std 983.227 K/3.001 K pps](987.489 Kpkts 505.594 Mbps in 1001033 usec) 56.83 avg_batch 375 min_space
240.280890 main_thread [2407] 985.852 Kpps [avg/std 983.609 K/2.944 K pps](986.873 Kpkts 505.279 Mbps in 1001036 usec) 56.26 avg_batch 384 min_space
241.281922 main_thread [2407] 985.623 Kpps [avg/std 983.926 K/2.888 K pps](986.640 Kpkts 505.160 Mbps in 1001032 usec) 56.61 avg_batch 375 min_space
242.282954 main_thread [2407] 982.438 Kpps [avg/std 983.841 K/2.911 K pps](983.452 Kpkts 503.527 Mbps in 1001032 usec) 56.62 avg_batch 371 min_space
243.283986 main_thread [2407] 983.849 Kpps [avg/std 984.013 K/2.830 K pps](984.864 Kpkts 504.250 Mbps in 1001032 usec) 57.30 avg_batch 371 min_space
244.285019 main_thread [2407] 982.294 Kpps [avg/std 983.934 K/2.860 K pps](983.308 Kpkts 503.454 Mbps in 1001032 usec) 56.40 avg_batch 359 min_space

(pkt-gen was set to generate 1 Mpps, output from the receiver side)

It seems quite weird to me that while I can easily do zerocopying between interfaces without loss only using polling, I have to manually call NOICTXSYNC when doing that between rings of the same interface.

vmaffione commented 7 years ago

Actually, poll() calls NIOCTXSYNC internally (see netmap manual), so you don't have to NIOCTXSYNC explicitely.

Doing NIOCTXSYNC without updating ring->cur/ring->head is pointless. You're just wasting cycles, so it is equivalent to spinning for 1-2 microseconds, that may possible have a beneficial effect for you, see below. Your problem may be related to batching: Since the packet rate is not very high, it is very likely that you are doing one poll() (and maybe one ioctl()) per packet, which is quite expensive and may introduce the delay that cause packet loss. Try to measure the average batch in your program (e.g. how many packet you receive and forward for each poll()/ioctl()). If it is 1, or anyway too low, try to modify your program so that it works with a larger batch (e.g. by sleeping for a bunch of microseconds).

vista- commented 7 years ago

Thanks for clearing that up.

I removed the manual ioctl call from my code and I have performed some measurements regarding the average amount of packets per batch (per poll()), and I got the following result:

avg per batch (95490 samples): 180.828191 (the rx/tx counts should be the same)

Is that figure still too low?

By the way, I am only polling with POLLOUT set when there is no more space in the tx queue, see here. I have already tried setting POLLOUT when there is less than 64, 128 and 256 "spaces" in the tx queue, that did not seem to help either.

Am I approaching this problem from the wrong direction?

vmaffione commented 7 years ago

Ah, 180 is very good. So your problem is not batching (unless you are spinning on poll and you don't account for that in the batching, but I don't think so). POLLOUT should be set only when there is no TX space. If you use it when there is TX space in the rings, then poll() will return immediately. So it would be again wasting cycles.

What is the CPU utilization of your program?

vista- commented 7 years ago

At 1 Mpps, the CPU utilization is about 22-23%.

vmaffione commented 7 years ago

Which means that you don't have any limitation in terms of CPU, as expected. Maybe your interrupt moderation settings for NICs are too high ? (e.g. too few interrupts, batch is too big, you lose packets) You may try to configure your system to work at a smaller batch than 180 (both receive and transmit).

vista- commented 7 years ago

I have made a bunch of tests using varying packet limits per ring (by setting the limit variable), here are the results: https://gist.github.com/vista-/3465211923e2d051aa4b10846cfa7c30

vmaffione commented 7 years ago

I don't expect any improvement by setting the limit variable. In this way you are just incrementing the number of poll() syscalls, but you are not changing the receive batch (e.g. how much RX packets are ready each time that poll() wakes up). You may manage to reduce batch by playing with "ethtool -C".

Anyway, it seems your issue is related to delays. It seems you are not able to react quickly enough to avoid queues overrun. This may be due to C-states, that is your CPU enters a deep C-state when poll() sleeps. Exiting from the C-state takes time, so you may try to disable C-states on your machine.

In any case, if you want to debug further you first of all need to understand where the packets are dropped, e.g. they are dropped because there is no more space in the RX ring, or because there is no more space in the TX ring, etc.

vista- commented 7 years ago

As far as I can see, the packets are not dropped in the RX ring, the "pps" count, which is tied to the receipt of packets, matches the pps set on the traffic generator.

Disabling C-states did improve and stabilize performance, but I'd like to note that even in previous tests, the governor was already set to performance for all cores and the netmap program was pinned to a core using taskset.

I did mess around with the hardware ring buffer sizes with ethtool -G, increasing them from the default 512 to first 1024, then 2048. Are these settings respected / affecting netmap's inner workings?

Might be placebo, but after setting both rx and tx hw ring buffers with ethtool -G to 2048, the following figures seem to be stable... With the 1024 setting, the ~998.5 kpps figure hold for about 15-20 seconds, then plummets to the previously seen values.

The current output of the receiving side of pkt-gen looks like this: 664.947175 main_thread [2407] 998.221 Kpps [avg/std 998.399 K/78.968 pps](999.252 Kpkts 511.617 Mbps in 1001033 usec) 55.84 avg_batch 384 min_space 665.948208 main_thread [2407] 998.493 Kpps [avg/std 998.400 K/80.773 pps](999.524 Kpkts 511.756 Mbps in 1001033 usec) 56.13 avg_batch 376 min_space 666.949241 main_thread [2407] 998.465 Kpps [avg/std 998.408 K/81.316 pps](999.496 Kpkts 511.742 Mbps in 1001033 usec) 55.83 avg_batch 389 min_space 667.950273 main_thread [2407] 998.482 Kpps [avg/std 998.414 K/83.253 pps](999.512 Kpkts 511.750 Mbps in 1001032 usec) 54.95 avg_batch 375 min_space 668.951307 main_thread [2407] 998.491 Kpps [avg/std 998.416 K/84.818 pps](999.523 Kpkts 511.756 Mbps in 1001034 usec) 55.43 avg_batch 376 min_space

The coalescing settings were already already set to a minimum during previous tests, the current settings are (defaults):

Coalesce parameters for eth4:
Adaptive RX: off  TX: off
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0

rx-usecs: 1
rx-frames: 0
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 0
tx-frames: 0
tx-usecs-irq: 0
tx-frames-irq: 128

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0

Lowering rx_usecs to 0 did not seem to help.

Regarding packet drops - does the output of ethtool -S's rx_missed_errors contain the correct value for dropped RX packets? If so, then with the aforementioned settings, this value stays the same.

vmaffione commented 7 years ago

Ok, so at this point you should count packets on the four points you have: generator, receiver (RX ring), transmitter (TX ring) and external receiver, send some millions of packets and see where you lose it. Increasing ring size does what expected: you have more buffering, which may beneficial if your problem is due to shallow buffers.

So rx_usecs seems ok. Have you tried to lower tx-frames-irq?

vista- commented 7 years ago

I have tried with a lower tx-frames-irq value to 0, but that did not seem to help.

I will try adding the counters in the following days - is the built-in counter for pkt-gen not precise enough? The counter inside my own application is incremented every time the "packet loop" is traversed.

I can't think of a way at the moment to precisely count the actually RX'd/TX'd packets - are the figures given by ethtool good enough, or do I have to rely on the values given to me by the netmap API (by nm_ring_space()?).

vmaffione commented 7 years ago

The pkt-gen counter should be precise. Anyway you can make small modifications if needed. You can use the pkt-gen -n 10000000 option to generate exactly 10000000 packets. Then you can count how many receive on the other pkt-gen instance, and how many you process in your program (e.g. how many times the execution passes through this point https://github.com/vista-/nm_vlan_tr/blob/master/nm_vlan_tr.c#L181).

I see the you are dropping packets here https://github.com/vista-/nm_vlan_tr/blob/master/nm_vlan_tr.c#L177 which could explain your numbers.

vista- commented 7 years ago

I already did count the number of iterations in my application, but I have added counters to various other parts of the loop (including line 177).

I got the following total counts when I sent 10 million packets, and I don't really see any packet drops:

pkt-gen (tx):
Sent 10000000 packets 640000000 bytes 39063 events 64 bytes each in 10.00 seconds.

nm_vlan_tr:
enter loop: 10000005 | before rewriting: 10000005 | after rewriting: 10000000 | end loop: 10000000 pps

pkt-gen (rx):
Received 9968444 packets 637980416 bytes 194159 events 64 bytes each in 16.01 seconds.

The 5 extra packets were received before starting the test, most likely CDP/LLC packets.

vmaffione commented 7 years ago

Maybe pkt-gen on the receive side is wrong. You could avoid running pkt-gen(rx) and just rely on interface counters (ethtool -S) to count. Have you disabled flow control on the NICs? Also note that doing txsync() does not mean the packets are transmitted, but that packets are enqueued. Then you are sure that packets are transmitted when ring->tail advances on the TX ring.

I think you need to figure if the missing packets are dropped on the TX side of nm_vlan_tr, on the RX ring (pkt-gen RX), or on the link between the two.

Fr3DBr commented 4 years ago

I'm trying to track the same problem, consistent 0.5% ~ 1.0% packet loss.

My application is basically a bridge between the NIC and the Host Stack, to intercept a few packets in between.

luigirizzo / netmap

ixgbe: strange packet loss when zerocopying packets on the same interface #342