TCP downlink fragmentation, out of order, retransmissions and poor throughput, UDP fine. Magma 1.8.

Radula76 commented 1 year ago

Prior to submitting an issue, check to see if one has already been created. If there is currently an open issue, add a thumbs-up emoji to identify that it is also affecting you.

Your Environment

Magma 1.8
Affected Component: Access Gateway
Affected Subcomponent: AGW service, pipelined or ovs
Deployment Environment: bare-metal (AGW)

Describe the Issue

South African customers complaints of 'poor performance' with TCP, see's UDP much better performance.

on our test system here in Melbourne; UDP downlink payload runs at 20Mbps. TCP downlink payload runs at 2 to 4Mbps. We have run a trace on the 4 internal interfaces of our magma 1.8 AGW, this is a 5G SA deployment with BTI Wireless 5G and a Teltonika RUTX50 CPE, wireless device running iPerf payload to a server on the same switch directly connected to the AGW. We can see the MTU get negotiated to 1400, but during the TCP session we can see the iPerf server starts stacking packets together, creating really large packets that hit the SGi/N6 interface, these get passed through to gtp_br0 and it looks like OVS is fragmenting them out of order towards gtpu_sys_2152 and then downlink to S1/N2/N3 interfaces. causing all sorts of missed ACK's and retransmissions.

We ran a similar TCP iPerf session on Magma 1.7, LTE deployment with BTI Wireless 4G and a Samsung J8 mobile phone. this negotiates the MTU to 1300. we see the iPerf server still stack packets to extra large sizes, but it is being fragmented on the downlink towards the Ue between SGi and gtp_br0. the throughput is much better and within 90% of UDP.

High-level diagram attached for the tcpdump points. the 'internet' here is a 1Gbps switch with the iPerf server running off it as well.

TCP stream statistics show some latency spikes which we would expect TCP to be around 80% of the UDP speeds, but not 10%. really the fragmentation out of order and retransmissions are crushing the performance of the link.

To Reproduce

iPerf TCP traffic near max throughput; iperf -c 192.168.2.101 -R -t 2 -b 20M iPerf UDP traffic: iperf -c 192.168.2.101 -R -t 2 -b 20M -u

TCP traffic 10% of UDP speeds or worse.

Expected behavior

some questions:

why is there correct fragmentation on Magma 1.7 between SGi and gtp_br0 and not Magma 1.8?
is the code managing this fragmentation in pipelined or openvswitch?
should TCP packets with IP do-not-fragment get fragmented in which part of the AGW?
do we need to run MTU 1300 instead of 1400 on the network?

Expected behavior for the Magma 1.8 to match Magma 1.7. TCP performance similar to UDP.

Screenshots

network diagram

Additional context

OVS version on Magma 1.8 5G: openvswitch-switch/stable-4.0.0,now 2.15.4-9-magma amd64 [installed] Magma 1.7 LTE is: openvswitch-switch/focal-1.7.0,now 2.15.4-8 amd64 [installed,automatic]

attached a trace file of the 4 interfaces, 2 seconds TCP iPerf followed by 2 seconds of UDP traffic. ue IP: AGW external IP: 192.168.2.231 iperf server IP: 192.168.2.101

TCP_UDPtrace.tar.gz

We have the 100Mb LTE trace showing the Magma 1.7 behavior of the sgi/gtp_br0 as well if required.

We have another trace where we have tried to mss clamp the SGi interface to 1400.

test system is available for any other logs/re-tests/procedures.

thanks!

Radula76 commented 1 year ago

Ue IP: 192168.128.15

bhuvaneshne commented 1 year ago

Hi @Radula76

Try turning off gro on your interfaces.

ethtool -K eth0 gro off

Radula76 commented 1 year ago

We have set "generic-receive-offload: off" on both the N2/N3 and N6 interfaces with no difference in throughput. TCP still 2 or 3Mbit compared to UDP. I can re-run all the traces again but the behaviour looks exactly the same.

bhuvaneshne commented 1 year ago

I am suspecting that end to end path MTU discovery is not happening. Can you try setting the MTU on N6 to a lower number- say 1100 and try re-testing?

Radula76 commented 1 year ago

We can see in the N6 trace packet 194 TCP syn to iperf server has MTU 1400, syn/ack return packet 195 has MTU 1460. the main problem seems to be the iPerf server is merging packets together to meet the throughput requested/catch-up to the payload when the ACK's are slow. I would expect the N6 physical interface to correctly split these back up to send into the AGW gtp_br0/OVS. we don't really want OVS to be fragmenting at all right?

we had better performance on LTE with 1300, I can re-run the tests setting our N6 interface specifically to 1300.

bhuvaneshne commented 1 year ago

Merging packets usually is an indication that there is some kind on offload running along the path. This usually happens on TCP streams. Can you try and mirror the ports and check where exactly the packets are getting fused by comparing the capture on linux (tcpdump) v/s mirrored captures.

Radula76 commented 1 year ago

New traces added, N6 interface set to MTU 1300, behaviour looks exactly the same. the MTU of the Teltonika is 1400, so I think that is what is driving the end-to-end MTU? N2/N3 syn (packet 43) MTU 1460. syn/ack 1400 GTPU syn (packet 1) MTU 1260. syn/ack 1400 gtp_br0 syn (packet 128) MTU 1460. syn/ack 1400 N6 syn (packet 408) MTU 1400, syn/ack (packet 409) 1460.

so it looks like gtpu_sys_2152 is fragmenting or OVS again to 1300? we can see that on N2N3 downlink we have IP>UDP>GTP>IP>TCP packets at 1498 MTU(1400byte payload).

MTU1300test.tar.gz

I don't see any extra large merged/fused packets on the N6 interface now. thats where we were seeing them on the previous trace. the packets are just a bit big for that MTU on gtpu_sys now?

bhuvaneshne commented 1 year ago

Thanks for the captures. In the handshake, I see that the MSS is set to 1460. Next best thing to try MSS clamping on either ends.

Radula76 commented 1 year ago

thanks @bhuvaneshne , I've tried MSS clamping on the N6 interface, but in reality we won't be able to clamp the iperf servers if they are on the internet? where would you suggest MSS clamping? should we try the Teltonika CPE at 1300 first? then the N2/N3 and N6? what's going on with the gtpu and gtp_br0 interfaces and ovs?

bhuvaneshne commented 1 year ago

Yes, start from CPE. If possible, launch a local iperf server for your experiments and clamp MSS there too. I am not sure if OVS will fragment the packets send out more than one GTP packet for every incoming "big" packet (This question of yours: is the code managing this fragmentation in pipelined or openvswitch?)- Someone from community could help you here.

Radula76 commented 1 year ago

ok, we have a MTU override in the Teltonika set to 1300. I noticed we don't send MTU size in PDU session setup? the N6 interface syn/ack from iperf server is still at 1460 (packet 639), but the TCP payload is 1260. this seems fine with gtp_br0(packets from 241 on), but as soon as it hits gtpu_sys_2152 (packets 234 onwards) it's already 'previous segment not seen' and 'Out of Order'? is this downlink payload being messed up in OVS?

it doesn't look like fragmentation so much, is OVS offloading packets to different CPU cores which are processing things at different speeds? TT1300.tar.gz

bhuvaneshne commented 1 year ago

it doesn't look like fragmentation so much, is OVS offloading packets to different CPU cores which are processing things at different speeds?

This is an interesting angle. I was thinking that packets can be reordered at the destination as long as it is within the window.

Just to confirm: Did you use iperf options to set the MSS and length of buffers for your experiments?

PS: These are from iperf3

Radula76 commented 1 year ago

No, it was a standard iperf -c 192.168.2.101 -b 20M -R add -u for the udp test

bhuvaneshne commented 1 year ago

If there is no other leads for debugging, try with iperf3 - setting lower values to MSS and length.

Radula76 commented 1 year ago

I think we are at a pretty low value already. I can't see fragmentation on the last trace. its the downlink gtp_br0 -> OVS -> gtpu_sys_2152 that is out of order and missing previous segments.

Radula76 commented 1 year ago

we are running the AGW on baremetal AMD Ryzen 5 5600G, 6 core/12 threads. we can see all 12 threads are carrying Openvswitch instances.

Radula76 commented 1 year ago

We are running new traces and testing for this issue.

bhuvaneshne commented 1 year ago

Hi @Radula76 , did you find solution for this issue?

Radula76 commented 1 year ago

Hi @bhuvaneshne, there are several factors to this case.
for Fragmentation we need to look at MTU across the whole network and MSS clamping, which helps in lowering the MTU. If you are running NAT AGW then the MSS clamping is done on the SGi side. we need to be aware of the GTP headers (up to 50bytes) and set the Ue/CPE MTU's to 1450 or lower. MSS clamping will then tell the SGi/internet side routers that we only support MTU 1450, so we don't get a 1500 byte packet and then have to fragment it when we try to put GTP headers on.

The TCP out of order looks to be OVS changing the packet size in the header, even though the packet remains small. it shows up in wireshark as an error but traffic is generally flowing ok when we measure on the GTP link or the Internet link.

Fragmentation will kill the throughput speeds but the TCP out of order doesn't seem to affect it that much.

We are looking forward to the new data plane changes from Wavelabs. We are also running a lot more of our AGW's in non-NAT mode. We have found that external CGNAT have faster packet handling than the AGW's software based ones.

bhuvaneshne commented 1 year ago

Thanks @Radula76

magma / magma