nestybox / sysbox

An open-source, next-generation "runc" that empowers rootless containers to run workloads such as Systemd, Docker, Kubernetes, just like VMs.
Apache License 2.0
2.78k stars 152 forks source link

Sysbox integration-tests failures due to interface's MTU mismatch #36

Closed rodnymolina closed 4 years ago

rodnymolina commented 4 years ago

Problem affects many of Sysbox's integration-testcases where large packet sizes need to be exchanged (e.g. docker pull). At this point is not clear if problem can be reproduced in regular system-containers (L1) and/or their child app containers.

At high-level problem's symptom is very obvious: network transactions initiated by an L2 sys-container (within sysbox's privileged test-cntr), that require exchange of large-size packets, can stall during "docker pull" execution.

The following elements are required for problem to reproduce:

From a topological perspective, this is the path to be traversed by every packet generated within the L2 container:

data-center-fabric <--> host's egress-iface (A) <--> host's docker0 iface (B) <--> L1's egress-iface (C) <--> L1's docker0 iface (C) 

Notice that the MTU along this path is always 1500 bytes, except on the last network element (from L2's container perspective), where the MTU for A is 1460 bytes.

As per PMTU's ietf specs, upon arrival of a datagram with a size larger than the egress iface's MTU, if "DF" bit is set in its L3 header, this package must be dropped and an ICMP "Fragmentation Needed" message must be sent back to hint the origin about the need to reduce the packet size in subsequent attempts. The network-stack in origin, should then generate an entry to keep track of the discovered MTU value associated to the remote IP peer. At the same time, the network stack must notify the application of the need to adjust its PDU size. At this point the application typically opts by reducing the size of the subsequent packets so that these ones can now reach the remote end.

In our scenario, we expect interface B (host's docker0) to be the spot in which ICMP's fragmentation messages are generated, as this is the location where there's an MTU discrepancy. That's precisely what we observe during problem reproduction:

16:24:11.481123 IP 172.17.0.1 > 172.18.0.6: ICMP ec2-52-5-11-128.compute-1.amazonaws.com unreachable - need to frag (mtu 1460), length 556
16:24:11.551969 IP ec2-52-5-11-128.compute-1.amazonaws.com.https > 172.18.0.6.55310: Flags [.], ack 1587, win 119, options [nop,nop,TS val 4082793925 ecr 1493915319], length 0
16:24:11.632656 IP 172.18.0.6.55310 > ec2-52-5-11-128.compute-1.amazonaws.com.https: Flags [P.], seq 3035:3104, ack 5326, win 501, options [nop,nop,TS val 1493915471 ecr 4082793925], length 69
16:24:11.662150 IP ec2-52-5-11-128.compute-1.amazonaws.com.https > 172.18.0.6.55310: Flags [.], ack 1587, win 128, options [nop,nop,TS val 4082794035 ecr 1493915319,nop,nop,sack 1 {3035:3104}], length 0
16:24:11.662203 IP 172.18.0.6.55310 > ec2-52-5-11-128.compute-1.amazonaws.com.https: Flags [.], seq 1587:3035, ack 5326, win 501, options [nop,nop,TS val 1493915500 ecr 4082794035], length 1448
16:24:11.662249 IP 172.17.0.1 > 172.18.0.6: ICMP ec2-52-5-11-128.compute-1.amazonaws.com unreachable - need to frag (mtu 1460), length 556
16:24:11.900648 IP 172.18.0.6.55310 > ec2-52-5-11-128.compute-1.amazonaws.com.https: Flags [.], seq 1587:3035, ack 5326, win 501, options [nop,nop,TS val 1493915739 ecr 4082794035], length 1448
16:24:11.900764 IP 172.17.0.1 > 172.18.0.6: ICMP ec2-52-5-11-128.compute-1.amazonaws.com unreachable - need to frag (mtu 1460), length 556
16:24:12.392772 IP 172.18.0.6.55310 > ec2-52-5-11-128.compute-1.amazonaws.com.https: Flags [.], seq 1587:3035, ack 5326, win 501, options [nop,nop,TS val 1493916231 ecr 4082794035], length 144

However, we don't see the source application ("docker pull" in this case) reacting to this msg, so it continues generating packets of the same size, and communication ultimately stalls.

rodnymolina commented 4 years ago

After further research, i have been able to verify that problem is not affecting Sysbox's regular scenario. Problem is definitely impacting Sysbox's integration-framework, but not the production environment.

What follows is a summary of the actions being executed and the obtained results:

Created a sys-container (L1) with dockerd and sshd. Verify that both "docker pull" and ssh transferences work as expected. Also confirm that PMTU process is kicking in and a route-entry (with MTU information) is being generated by kernel's network-stack.

Create an L2 privilege container with docker:dind image over the existing L1 sys-container. Verify that "docker pull" operation works fine within the L2 container, and PMTU mechanism works as expected.

Create an L2 regular container over the existing L1 sys-container. Install "ssh" and verify that PMTU algorithm is working as expected when uploading large files off this L2 cntr. See sample output below ...

<-- tcpdump output:

07:40:42.538649 IP 172.18.0.2.22 > 24.7.53.252.55977: Flags [P.], seq 1041:3937, ack 1124, win 501, options [nop,nop,TS val 3599766946 ecr 2204038159], length 2896 07:40:42.538730 IP 172.17.0.1 > 172.18.0.2: ICMP 24.7.53.252 unreachable - need to frag (mtu 1460), length 556 07:40:42.538747 IP 172.18.0.2.22 > 24.7.53.252.55977: Flags [P.], seq 3937:6833, ack 1124, win 501, options [nop,nop,TS val 3599766946 ecr 2204038159], length 2896 07:40:42.538767 IP 172.17.0.1 > 172.18.0.2: ICMP 24.7.53.252 unreachable - need to frag (mtu 1460), length 556

<-- ip route entry

root@dd6272c17126:/# ip route show cache 24.7.53.252 via 172.18.0.1 dev eth0 cache expires 584sec mtu 1460 root@dd6272c17126:/#

rodnymolina commented 4 years ago

To prevent MTU related issues like this one we can simply ensure that the docker interface is configured with an MTU value that matches the one of the egress interface.

Fix for this one will need to update test-container's initialization script as well as Sysbox installer.

rodnymolina commented 4 years ago

Closing now.