CI Build Test: Ran into FATA[0005] subnet 10.4.0.0/24 overlaps with other one on this address space

gunamata commented 2 years ago

Actual Behavior

Ran into below error after running a container.

FATA[0005] subnet 10.4.0.0/24 overlaps with other one on this address space

I observed this behavior with the CI build: https://github.com/rancher-sandbox/rancher-desktop/actions/runs/3071306322

Steps to Reproduce

Run a container nerdctl run -d -p 85:80 --restart=always nginx
Downgrade Kubernetes version
Reset Kubernetes with Images
Run container nerdctl run -d -p 85:80 --restart=always nginx

Result

Ran into below error after running a container.

FATA[0005] subnet 10.4.0.0/24 overlaps with other one on this address space

Expected Behavior

The container should run with out errors

Additional Information

No response

Rancher Desktop Version

https://github.com/rancher-sandbox/rancher-desktop/actions/runs/3071306322

Rancher Desktop K8s Version

1.21.4

Which container engine are you using?

containerd (nerdctl)

What operating system are you using?

Windows

Operating System / Build Version

Windows 10 Enterprise

What CPU architecture are you using?

x64

Linux only: what package format did you use to install Rancher Desktop?

No response

Windows User Only

No response

Nino-K commented 2 years ago

The problem here is that the underlying container engine (CNI) checks any newly created network route against the existing routes on the system. If a route rule with an IP address from a conflicting subnet on the Iptables exists it will yeild to this error. The conflicting routes could be either from the host network (bridge mode) or Kube network in this case. A long-term workaround would be, we can either detect conflicting addresses and change the available network pools to the container engine accordingly. Or, as a short term solution we can document on how to manually change the network pool address.

jandubois commented 2 years ago

I have not been able to repro this because #2934 is blocking me from getting to a working system.

jandubois commented 2 years ago

Doing a Factory Reset allowed me to go past #2934, but I still cannot repro this.

I got an error once:

e:\home\jan>nerdctl run -d -p 85:80 --restart=always nginx
FATA[0002] OCI runtime start failed: cannot start a container that has stopped: unknown

But that was maybe while containerd was still starting up.

Afterwards I could run the command repeatedly without getting any error. I'm somewhat surprised though that nerdctl didn't tell me that the port was already in use:

e:\home\jan>nerdctl ps -a
CONTAINER ID    IMAGE                             COMMAND                   CREATED               STATUS    PORTS                 NAMES
0bb0a5a0de03    docker.io/library/nginx:latest    "/docker-entrypoint.…"    8 minutes ago         Up        0.0.0.0:85->80/tcp    nginx-0bb0a
0fb32c58e483    docker.io/library/nginx:latest    "/docker-entrypoint.…"    13 seconds ago        Up        0.0.0.0:85->80/tcp    nginx-0fb32
3d66a24f31ce    docker.io/library/nginx:latest    "/docker-entrypoint.…"    10 seconds ago        Up        0.0.0.0:85->80/tcp    nginx-3d66a
5449641543de    docker.io/library/nginx:latest    "/docker-entrypoint.…"    About a minute ago    Up        0.0.0.0:85->80/tcp    nginx-54496
bc5eb13c4ab8    docker.io/library/nginx:latest    "/docker-entrypoint.…"    3 minutes ago         Up        0.0.0.0:85->80/tcp    nginx-bc5eb
d33650d61a92    docker.io/library/nginx:latest    "/docker-entrypoint.…"    7 seconds ago         Up        0.0.0.0:85->80/tcp    nginx-d3365

FWIW, 10.4.0.0/24 is the network created by nerdctl itself:

16: nerdctl0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether b2:76:c0:57:f0:ea brd ff:ff:ff:ff:ff:ff
    inet 10.4.0.1/24 brd 10.4.0.255 scope global nerdctl0

So any reported conflict would be with the network already set up by the previous container starting while setting up a new one at the same time.

jandubois commented 2 years ago

I forgot the "Reset Kubernetes with Images" step. After I've done this, I get the error too:

e:\home\jan>nerdctl run -d -p 85:80 --restart=always nginx
docker.io/library/nginx:latest:                                                   resolved       |++++++++++++++++++++++++++++++++++++++|
index-sha256:0b970013351304af46f322da1263516b188318682b2ab1091862497591189ff1:    done           |++++++++++++++++++++++++++++++++++++++|
manifest-sha256:79c77eb7ca32f9a117ef91bc6ac486014e0d0e75f2f06683ba24dc298f9f4dd4: done           |++++++++++++++++++++++++++++++++++++++|
config-sha256:2d389e545974d4a93ebdef09b650753a55f72d1ab4518d17a30c0e1b3e297444:   done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:600c24b8ba3900f029e02f62ad9d14a04880ffdf7b8c57bfc74d569477002d67:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:31b3f1ad4ce1f369084d0f959813c51df0ca17d9877d5ee88c2db6ff88341430:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:fd42b079d0f818ce0687ee4290715b1b4843a1d5e6ebe7e3144c55ed11a215ca:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:30585fbbebc6bc3f81cb80830fe83b04613cda93ea449bb3465a08bdec8e2e43:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:18f4ffdd25f46fa28f496efb7949b137549b35cb441fb671c1f7fa4e081fd925:    done           |++++++++++++++++++++++++++++++++++++++|
layer-sha256:9dc932c8fba266219fd16728c9e3f632296d043407e77d6af626c5119f021b42:    done           |++++++++++++++++++++++++++++++++++++++|
elapsed: 15.7s                                                                    total:  30.0 M (1.9 MiB/s)
FATA[0017] subnet 10.4.0.0/24 overlaps with other one on this address space

gunamata commented 2 years ago

I could repro this on 1.5.1 (latest release at this time) too. Here are the steps: (Same steps as I mentioned in the initial issue description, just that I captured some additional info about the Kubernetes versions I used)

Reset Kubernetes to start fresh, I am on Kubernetes version v1.25.0
Run a container nerdctl run -d -p 85:80 --restart=always nginx
Downgrade to V1.20.15
Reset Kubernetes with Images
Run a container nerdctl run -d -p 85:80 --restart=always nginx

jandubois commented 2 years ago

Out of curiosity I tried this on macOS as well, and I couldn't repro it there.

I didn't really expect it anyways; earlier discussion with @mook-as produced the theory that the problem comes because "deleting" the VM on WSL does not really restart WSL, and since networking is shared between distros, it is possible that the old network definitions were not cleaned up properly.

jandubois commented 2 years ago

This has nothing to do with k8s and downgrading I can repro with this simplified steps:

Fresh install of 1.5.1 (with k8s disabled)
Run nerdctl run -d -p 85:80 --restart=always nginx
Reset Kubernetes with Images [^1]
Run nerdctl run -d -p 85:80 --restart=always nginx

So it seems indeed like the nerdctl0 network is lingering even though the rancher-desktop distro got deleted and recreated.

Since it is not a regression, I think this could be moved to the "Later" milestone.

[^1]: It doesn't really make sense to call this "Reset Kubernetes" while Kubernetes is disabled.

gunamata commented 2 years ago

Doing Factory Reset or restarting the machine resolved this issue for me on Windows 10 Enterprise.. Just sharing if it helps with the investigation of the problem..

jandubois commented 2 years ago

Doing Factory Reset or restarting the machine resolved this issue for me on Windows 10 Enterprise..

I would think that anything that shuts down the WSL VM (and not just the individual distro) would fix it because I don't see how a network definition would survive the restart.

So I think wsl --shutdown would fix the problem, but it is rather heavy-handed, as it will stop all other distros as well. At the very least we would need an extra warning/confirmation from the user.

gaktive commented 2 years ago

@gunamata to provide material to update the FAQ around this. We should test a WSL shutdown around this too.

Nino-K commented 2 years ago

So I think wsl --shutdown would fix the problem, but it is rather heavy-handed, as it will stop all other distros as well. At the very least we would need an extra warning/confirmation from the user.

If we are taking the short-term approach, we should be able to change the default network pool (/etc/cni/net.d if exist, if not create one) available to containerd instead of shutting down WSL. This would be very similar to docker's default-address-pool. e.g.

"default-address-pools": [
       {
           "base": "10.17.0.1/16",
            "size": 16
        }
]

This example might be useful: https://github.com/containerd/containerd/blob/main/script/setup/install-cni

Nino-K commented 2 years ago

@gunamata this (custom networks) should be sufficient for documentation purposes. Although I have not tested it with our version of nerdctl.

jandubois commented 2 years ago

Although I have not tested it with our version of nerdctl.

Please test before adding to docs! We should be sure it actually works. 😺

gaktive commented 1 year ago

Based on a comment in #3365, it looks like nerdctl introduced this in https://github.com/containerd/nerdctl/pull/1245.

rancher-sandbox / rancher-desktop