rancher-sandbox / rancher-desktop

Container Management and Kubernetes on the Desktop
https://rancherdesktop.io
Apache License 2.0
5.84k stars 272 forks source link

WSL: Need to force `net.ipv4.ip_forward` #5341

Closed mook-as closed 2 months ago

mook-as commented 1 year ago

Actual Behavior

Sometimes net.ipv4.ip_forward defaults to 0, causing traefik to fail to start up. It's unclear what circumstances lead to this.

Steps to Reproduce

Result

The traefik lb pod goes into CrashLoopBackoff; examining logs indicates that /proc/sys/net/ipv4/ip_forward was set to 0 (instead of the expected 1)

Expected Behavior

Rancher Desktop should do the necessary set up so that the WSL VM is in a state that can run our workloads.

Additional Information

Manually running sysctl -w net.ipv4.ip_forward=1 (in a different WSL distribution) and then restarting Rancher Desktop appears to fix the issue.

Rancher Desktop Version

1.9.1-512-g48956782

Rancher Desktop K8s Version

1.22.7

Which container engine are you using?

containerd (nerdctl)

What operating system are you using?

Windows

Operating System / Build Version

Windows 10 Pro 22H2 (Build 19045.3324)

What CPU architecture are you using?

x64

Linux only: what package format did you use to install Rancher Desktop?

None

Windows User Only

N/A

jandubois commented 3 months ago

I've been able to reproduce this issue on Windows 11 while @Nino-K did not observe it on Windows 10. We made sure that we were using the same WSL2 versions and the same kernel version.

net.ipv4.ip_forward is set to 1 inside the WSL distro, and in the regular traefik pod, but is 0 in the svclb pod (at least during startup, the container is stopped right away, so there is no chance to manually inspect it).

Manually running sysctl -w net.ipv4.ip_forward=1 (in a different WSL distribution) and then restarting Rancher Desktop appears to fix the issue.

Given that this is already enabled in the rancher-desktop distro, I'm surprised this makes a difference. Maybe it needs to be enabled in the default namespace?

There is code that forces net.ipv4.ip_forward=1 in svclb in k3s 1.25.3 that has been backported to the corresponding patch releases of 1.23 and 1.24, but not to any earlier versions (part of https://github.com/k3s-io/k3s/pull/6181).

So running Kubernetes 1.25.3+ is a workaround to avoid this problem.

We should still try to find a workaround, e.g. by enabling this option before creating our own separate namespace. Or maybe both before and after?

If this doesn't help, then we should create a diagnostic instead to tell the user that Traefik isn't working, and recommend upgrading to a non-obsolete version of Kubernetes. See also #6342.

In that case we also need to update all BATS tests using Traefik to either require a newer Kubernetes version, or skip the test if the requested version is too old.

We may also want to increase the default version used for testing to something more recent.

jandubois commented 2 months ago

Fixed by #7110