google / gvisor

Application Kernel for Containers
https://gvisor.dev
Apache License 2.0
15.86k stars 1.3k forks source link

ACK lost when GSO enabled #11198

Open GerardGarcia opened 2 days ago

GerardGarcia commented 2 days ago

Description

It appears to be that ACKs are not processed by gVisor netstack when the packet is big enough to be fragmented somewhere down the network stack. This causes the TCP connection to misbehave due to the client retransmitting and the server sending duplicate ACKs. If GSO (--gso=false) or the whole gVisor network stack is disabled (--network=host) the connection works as expected. I attach a few network dumps:

At the gVisor sandboxed container veth: Image

At gVisor (--pcap-log) Image

Outside the gVisor sandboxed container veth: Image

My interpretation is that ACKs at packets 11/12 are not seen by netstack which causes the retransmissions and ACK duplicates.

Steps to reproduce

In our environment is straightforward to replicate, just send a request with a large payload with, for example, curl:

curl -XPOST http://httpbin.org/post -d @req_large.json

If the request is smaller (payload less than 1420B) everything works as expected

runsc version

runsc version release-20241104.0-53-g03a28d158e54
spec: 1.1.0-rc.1

(it happens with other stable versions but I was testing with the latest nightly)

docker version (if using docker)

uname

Linux (...) 5.15.166-111.163.amzn2.x86_64 #1 SMP Fri Sep 6 21:31:40 UTC 2024 x86_64 GNU/Linux

kubectl (if using Kubernetes)

Server Version: v1.29.8-eks-a737599

We are running gVisor sandboxes within a pod not using gVisor sandboxes to wrap k8s pods

repo state (if built from source)

No response

runsc debug logs (if available)

I1120 10:07:10.190652  336244 main.go:195] **************** gVisor ****************
I1120 10:07:10.190684  336244 main.go:196] Version release-20241104.0-53-g03a28d158e54, go1.23.2 X:nocoverageredesign, amd64, 8 CPUs, linux, PID 336244, PPID 336186, UID 0, GID 0
D1120 10:07:10.190690  336244 main.go:197] Page size: 0x1000 (4096 bytes)
I1120 10:07:10.190696  336244 main.go:198] Args: [runsc --debug-log=/tmp/runsc/ --debug --strace --log-packets --pcap-log /tmp/runsc/2_gvisor.pcap start 88a346f8-98cf-4fcb-bebe-b14aa84d60dc]
I1120 10:07:10.190705  336244 config.go:439] Platform: systrap
I1120 10:07:10.190721  336244 config.go:440] RootDir: /var/run/runsc
I1120 10:07:10.190725  336244 config.go:441] FileAccess: exclusive / Directfs: true / Overlay: root:self
I1120 10:07:10.190731  336244 config.go:442] Network: sandbox
I1120 10:07:10.190735  336244 config.go:444] Debug: true. Strace: true, max size: 1024, syscalls: 
D1120 10:07:10.190740  336244 config.go:462] Config.RootDir (--root): /var/run/runsc
D1120 10:07:10.190748  336244 config.go:462] Config.Traceback (--traceback): system
D1120 10:07:10.190752  336244 config.go:462] Config.Debug (--debug): true
D1120 10:07:10.190756  336244 config.go:462] Config.LogFilename (--log): (empty)
D1120 10:07:10.190758  336244 config.go:462] Config.LogFormat (--log-format): text
D1120 10:07:10.190761  336244 config.go:462] Config.DebugLog (--debug-log): /tmp/runsc/
D1120 10:07:10.190764  336244 config.go:462] Config.DebugToUserLog (--debug-to-user-log): false
D1120 10:07:10.190767  336244 config.go:462] Config.DebugCommand (--debug-command): (empty)
D1120 10:07:10.190769  336244 config.go:462] Config.PanicLog (--panic-log): (empty)
D1120 10:07:10.190772  336244 config.go:462] Config.CoverageReport (--coverage-report): (empty)
D1120 10:07:10.190774  336244 config.go:462] Config.DebugLogFormat (--debug-log-format): text
D1120 10:07:10.190777  336244 config.go:462] Config.FileAccess (--file-access): exclusive
D1120 10:07:10.190780  336244 config.go:462] Config.FileAccessMounts (--file-access-mounts): shared
D1120 10:07:10.190783  336244 config.go:462] Config.Overlay (--overlay): false
D1120 10:07:10.190785  336244 config.go:462] Config.Overlay2 (--overlay2): root:self
D1120 10:07:10.190788  336244 config.go:462] Config.FSGoferHostUDS (--fsgofer-host-uds): false
D1120 10:07:10.190791  336244 config.go:462] Config.HostUDS (--host-uds): none
D1120 10:07:10.190794  336244 config.go:462] Config.HostFifo (--host-fifo): none
D1120 10:07:10.190799  336244 config.go:462] Config.HostSettings (--host-settings): check
D1120 10:07:10.190803  336244 config.go:462] Config.Network (--network): sandbox
D1120 10:07:10.190805  336244 config.go:462] Config.EnableRaw (--net-raw): false
D1120 10:07:10.190808  336244 config.go:462] Config.AllowPacketEndpointWrite (--TESTONLY-allow-packet-endpoint-write): false
D1120 10:07:10.190811  336244 config.go:462] Config.HostGSO (--gso): true
D1120 10:07:10.190813  336244 config.go:462] Config.GVisorGSO (--software-gso): true
D1120 10:07:10.190818  336244 config.go:462] Config.GVisorGRO (--gvisor-gro): false
D1120 10:07:10.190824  336244 config.go:462] Config.TXChecksumOffload (--tx-checksum-offload): false
D1120 10:07:10.190826  336244 config.go:462] Config.RXChecksumOffload (--rx-checksum-offload): true
D1120 10:07:10.190829  336244 config.go:462] Config.QDisc (--qdisc): fifo
D1120 10:07:10.190838  336244 config.go:462] Config.LogPackets (--log-packets): true
D1120 10:07:10.190841  336244 config.go:462] Config.PCAP (--pcap-log): /tmp/runsc/2_gvisor.pcap
D1120 10:07:10.190843  336244 config.go:462] Config.Platform (--platform): systrap
D1120 10:07:10.190846  336244 config.go:462] Config.PlatformDevicePath (--platform_device_path): (empty)
D1120 10:07:10.190848  336244 config.go:462] Config.MetricServer (--metric-server): (empty)
D1120 10:07:10.190851  336244 config.go:462] Config.FinalMetricsLog (--final-metrics-log): (empty)
D1120 10:07:10.190853  336244 config.go:462] Config.ProfilingMetrics (--profiling-metrics): (empty)
D1120 10:07:10.190856  336244 config.go:462] Config.ProfilingMetricsLog (--profiling-metrics-log): (empty)
D1120 10:07:10.190862  336244 config.go:462] Config.ProfilingMetricsRate (--profiling-metrics-rate-us): 1000
D1120 10:07:10.190864  336244 config.go:462] Config.Strace (--strace): true
D1120 10:07:10.190867  336244 config.go:462] Config.StraceSyscalls (--strace-syscalls): (empty)
D1120 10:07:10.190869  336244 config.go:462] Config.StraceLogSize (--strace-log-size): 1024
D1120 10:07:10.190872  336244 config.go:462] Config.StraceEvent (--strace-event): false
D1120 10:07:10.190874  336244 config.go:464] Config.DisableSeccomp: false
D1120 10:07:10.190879  336244 config.go:462] Config.EnableCoreTags (--enable-core-tags): false
D1120 10:07:10.190882  336244 config.go:462] Config.WatchdogAction (--watchdog-action): logWarning
D1120 10:07:10.190886  336244 config.go:462] Config.PanicSignal (--panic-signal): -1
D1120 10:07:10.190889  336244 config.go:462] Config.ProfileEnable (--profile): false
D1120 10:07:10.190891  336244 config.go:462] Config.ProfileBlock (--profile-block): (empty)
D1120 10:07:10.190893  336244 config.go:462] Config.ProfileCPU (--profile-cpu): (empty)
D1120 10:07:10.190896  336244 config.go:462] Config.ProfileHeap (--profile-heap): (empty)
D1120 10:07:10.190898  336244 config.go:462] Config.ProfileMutex (--profile-mutex): (empty)
D1120 10:07:10.190900  336244 config.go:462] Config.TraceFile (--trace): (empty)
D1120 10:07:10.190903  336244 config.go:462] Config.NumNetworkChannels (--num-network-channels): 1
D1120 10:07:10.190906  336244 config.go:462] Config.NetworkProcessorsPerChannel (--network-processors-per-channel): 0
D1120 10:07:10.190908  336244 config.go:462] Config.Rootless (--rootless): false
D1120 10:07:10.190911  336244 config.go:462] Config.AlsoLogToStderr (--alsologtostderr): false
D1120 10:07:10.190913  336244 config.go:462] Config.ReferenceLeak (--ref-leak-mode): disabled
D1120 10:07:10.190917  336244 config.go:462] Config.CPUNumFromQuota (--cpu-num-from-quota): false
D1120 10:07:10.190919  336244 config.go:462] Config.AllowFlagOverride (--allow-flag-override): false
D1120 10:07:10.190925  336244 config.go:462] Config.OCISeccomp (--oci-seccomp): false
D1120 10:07:10.190927  336244 config.go:462] Config.IgnoreCgroups (--ignore-cgroups): false
D1120 10:07:10.190930  336244 config.go:462] Config.SystemdCgroup (--systemd-cgroup): false
D1120 10:07:10.190932  336244 config.go:462] Config.PodInitConfig (--pod-init-config): (empty)
D1120 10:07:10.190935  336244 config.go:462] Config.BufferPooling (--buffer-pooling): true
D1120 10:07:10.190937  336244 config.go:462] Config.XDP (--EXPERIMENTAL-xdp): {0 }
D1120 10:07:10.190943  336244 config.go:462] Config.AFXDPUseNeedWakeup (--EXPERIMENTAL-xdp-need-wakeup): true
D1120 10:07:10.190945  336244 config.go:462] Config.FDLimit (--fdlimit): -1
D1120 10:07:10.190948  336244 config.go:462] Config.DCache (--dcache): -1
D1120 10:07:10.190951  336244 config.go:462] Config.IOUring (--iouring): false
D1120 10:07:10.190953  336244 config.go:462] Config.DirectFS (--directfs): true
D1120 10:07:10.190956  336244 config.go:462] Config.AppHugePages (--app-huge-pages): true
D1120 10:07:10.190958  336244 config.go:462] Config.NVProxy (--nvproxy): false
D1120 10:07:10.190961  336244 config.go:462] Config.NVProxyDocker (--nvproxy-docker): false
D1120 10:07:10.190963  336244 config.go:462] Config.NVProxyDriverVersion (--nvproxy-driver-version): (empty)
D1120 10:07:10.190966  336244 config.go:462] Config.NVProxyAllowedDriverCapabilities (--nvproxy-allowed-driver-capabilities): utility,compute
D1120 10:07:10.190968  336244 config.go:462] Config.TPUProxy (--tpuproxy): false
D1120 10:07:10.190971  336244 config.go:462] Config.TestOnlyAllowRunAsCurrentUserWithoutChroot (--TESTONLY-unsafe-nonroot): false
D1120 10:07:10.190974  336244 config.go:462] Config.TestOnlyTestNameEnv (--TESTONLY-test-name-env): (empty)
D1120 10:07:10.190976  336244 config.go:462] Config.TestOnlyAFSSyscallPanic (--TESTONLY-afs-syscall-panic): false
D1120 10:07:10.190979  336244 config.go:464] Config.explicitlySet: <map[string]struct {} Value> (unexported)
D1120 10:07:10.190984  336244 config.go:462] Config.ReproduceNAT (--reproduce-nat): false
D1120 10:07:10.190987  336244 config.go:462] Config.ReproduceNftables (--reproduce-nftables): false
D1120 10:07:10.190991  336244 config.go:462] Config.NetDisconnectOk (--net-disconnect-ok): true
D1120 10:07:10.190994  336244 config.go:462] Config.TestOnlyAutosaveImagePath (--TESTONLY-autosave-image-path): (empty)
D1120 10:07:10.190997  336244 config.go:462] Config.TestOnlyAutosaveResume (--TESTONLY-autosave-resume): false
D1120 10:07:10.190999  336244 config.go:462] Config.TestOnlySaveRestoreNetstack (--TESTONLY-save-restore-netstack): false
I1120 10:07:10.191002  336244 main.go:200] **************** gVisor ****************
D1120 10:07:10.191021  336244 state_file.go:76] Load container, rootDir: "/var/run/runsc", id: {SandboxID: ContainerID:88a346f8-98cf-4fcb-bebe-b14aa84d60dc}, opts: {Exact:false SkipCheck:false TryLock:false RootContainer:false}
D1120 10:07:10.191897  336244 sandbox.go:1943] ContainerRuntimeState, sandbox: "88a346f8-98cf-4fcb-bebe-b14aa84d60dc", cid: "88a346f8-98cf-4fcb-bebe-b14aa84d60dc"
D1120 10:07:10.191913  336244 sandbox.go:734] Connecting to sandbox "88a346f8-98cf-4fcb-bebe-b14aa84d60dc"
D1120 10:07:10.191972  336244 urpc.go:571] urpc: successfully marshalled 96 bytes.
D1120 10:07:10.192217  336244 urpc.go:614] urpc: unmarshal success.
D1120 10:07:10.192235  336244 sandbox.go:1948] ContainerRuntimeState, sandbox: "88a346f8-98cf-4fcb-bebe-b14aa84d60dc", cid: "88a346f8-98cf-4fcb-bebe-b14aa84d60dc", state: 1
D1120 10:07:10.192306  336244 container.go:431] Start container, cid: 88a346f8-98cf-4fcb-bebe-b14aa84d60dc
D1120 10:07:10.192322  336244 hostsettings.go:186] Checking host settings
D1120 10:07:10.192327  336244 hostsettings.go:278] Checking host setting: /sys/kernel/mm/transparent_hugepage/shmem_enabled
D1120 10:07:10.192350  336244 hostsettings.go:278] Checking host setting: /proc/sys/vm/max_map_count
D1120 10:07:10.192363  336244 hostsettings.go:278] Checking host setting: /proc/sys/user/max_user_namespaces
D1120 10:07:10.192372  336244 hostsettings.go:278] Checking host setting: /proc/sys/kernel/unprivileged_userns_clone
D1120 10:07:10.192382  336244 hostsettings.go:278] Checking host setting: /proc/sys/kernel/unprivileged_userns_apparmor_policy
W1120 10:07:10.192387  336244 hostsettings.go:44] Host setting "/sys/kernel/mm/transparent_hugepage/shmem_enabled" (currently: "always within_size advise [never] deny force") is not optimal (turning on transparent hugepages support in shmem increases memory allocation performance); it is recommended to change it to "advise"
W1120 10:07:10.192399  336244 hostsettings.go:44] Host setting "/proc/sys/vm/max_map_count" (currently: "524288") is not optimal (increasing max_map_count decreases the likelihood of host VMA exhaustion); it is recommended to change it to "4194304"
D1120 10:07:10.192402  336244 sandbox.go:409] Start root sandbox "88a346f8-98cf-4fcb-bebe-b14aa84d60dc", PID: 336205
D1120 10:07:10.192407  336244 sandbox.go:734] Connecting to sandbox "88a346f8-98cf-4fcb-bebe-b14aa84d60dc"
I1120 10:07:10.192426  336244 network.go:56] Setting up network
I1120 10:07:10.192467  336244 namespace.go:108] Applying namespace network at path "/proc/336205/ns/net"
I1120 10:07:10.192584  336244 network.go:173] Skipping down interface: {Index:1 MTU:65536 Name:lo HardwareAddr: Flags:loopback}
D1120 10:07:10.192768  336244 network.go:305] Setting up network channels
D1120 10:07:10.192777  336244 network.go:308] Creating Channel 0
D1120 10:07:10.192829  336244 network.go:339] Setting up network, config: {FilePayload:{Files:[0xc000380e18 0xc000380e28]} LoopbackLinks:[] FDBasedLinks:[{Name:eth0 InterfaceIndex:0 MTU:1500 Addresses:[100.69.33.77/10] Routes:[{Destination:{IP:100.64.0.0 Mask:ffc00000} Gateway:<nil>}] GSOMaxSize:65536 GVisorGSOEnabled:false GVisorGRO:false TXChecksumOffload:false RXChecksumOffload:true LinkAddress:52:84:f2:eb:cc:d2 QDisc:fifo Neighbors:[] NumChannels:1 ProcessorsPerChannel:0}] XDPLinks:[] Defaultv4Gateway:{Route:{Destination:{IP:0.0.0.0 Mask:00000000000000000000ffff00000000} Gateway:100.64.0.1} Name:eth0} Defaultv6Gateway:{Route:{Destination:{IP:<nil> Mask:<nil>} Gateway:<nil>} Name:} PCAP:true LogPackets:true NATBlob:false DisconnectOk:true}
D1120 10:07:10.192994  336244 urpc.go:571] urpc: successfully marshalled 778 bytes.
D1120 10:07:10.193702  336244 urpc.go:614] urpc: unmarshal success.
I1120 10:07:10.193709  336244 namespace.go:129] Restoring namespace network
D1120 10:07:10.193725  336244 urpc.go:571] urpc: successfully marshalled 84 bytes.
D1120 10:07:10.198124  336244 urpc.go:614] urpc: unmarshal success.
D1120 10:07:10.198148  336244 container.go:1077] Save container, cid: 88a346f8-98cf-4fcb-bebe-b14aa84d60dc
D1120 10:07:10.199618  336244 state_file.go:76] Load container, rootDir: "/var/run/runsc", id: {SandboxID:88a346f8-98cf-4fcb-bebe-b14aa84d60dc ContainerID:88a346f8-98cf-4fcb-bebe-b14aa84d60dc}, opts: {Exact:true SkipCheck:true TryLock:false RootContainer:false}
I1120 10:07:10.199774  336244 main.go:221] Exiting with status: 0
EtiennePerot commented 17 hours ago

cc @kevinGC

GerardGarcia commented 7 hours ago

This is a gVisor dump with --gso=false Image

GerardGarcia commented 5 hours ago

The gVisor sandbox appears to hang when the error occurs. When executing curl with a large payload it blocks and is it impossible to execute any other command that uses the network (commands that do not access the network work fine). I attach a dump of the goroutines while blocked.

runsc version release-20241118.0-15-gb15656de596e spec: 1.1.0-rc.1 dlv.log