k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.82k stars 2.33k forks source link

Iptables not properly set when using dual stack with ipv6 #7211

Closed HOSTED-POWER closed 1 year ago

HOSTED-POWER commented 1 year ago

Environmental Info: K3s Version: v1.25.8+k3s1

Node(s) CPU architecture, OS, and Version: 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

Cluster Configuration: 1 server

Describe the bug: When using configserver (csf) with predefined iptables rules, we never had any issues. K3S properly creates all firewall rules (using ipv4 only). Now with ipv6 activated we get timeout

For example: [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.43.0.1:443/version": dial tcp 10.43.0.1:443: i/o timeout

But this never recovers...

Steps To Reproduce: Install k3s with these arguments:

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="{v1.25.8+k3s1}" K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC="server --disable=traefik --cluster-cidr=10.42.0.0/16,fc00:a0::/64 --service-cidr=10.43.0.0/16,2001:cafe:42:1::/112 --flannel-ipv6-masq" sh -

And you get the non working situation when csf (consfi)

Install like this, and there are 0 issues in the same environment:

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="{v1.25.8+k3s1}" K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC="server --disable=traefik" sh -

Another solution:

iptables -I INPUT -d 10.43.0.0/16 -j ACCEPT
iptables -I OUTPUT -d 10.43.0.0/16 -j ACCEPT
iptables -I INPUT -d 10.42.0.0/16 -j ACCEPT
iptables -I OUTPUT -d 10.42.0.0/16 -j ACCEPT

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION="{v1.25.8+k3s1}" K3S_KUBECONFIG_MODE="644" INSTALL_K3S_EXEC="server --disable=traefik --cluster-cidr=10.42.0.0/16,fc00:a0::/64 --service-cidr=10.43.0.0/16,2001:cafe:42:1::/112 --flannel-ipv6-masq" sh -

Also leads to a working installation ...

Expected behavior: A properly working k3s, also with ipv6

Actual behavior: Not working, coredns and metrics service keeps crashing and get into crashloopbackoff, also there seems a chain what normally empty which is being filled:

Chain KUBE-SERVICES (2 references)
target     prot opt source               destination
REJECT     tcp  --  anywhere             10.43.0.10           /* kube-system/kube-dns:dns-tcp has no endpoints */ tcp dpt:domain reject-with icmp-port-unreachable
REJECT     tcp  --  anywhere             10.43.0.10           /* kube-system/kube-dns:metrics has no endpoints */ tcp dpt:9153 reject-with icmp-port-unreachable
REJECT     udp  --  anywhere             10.43.0.10           /* kube-system/kube-dns:dns has no endpoints */ udp dpt:domain reject-with icmp-port-unreachable
REJECT     tcp  --  anywhere             10.43.92.92          /* kube-system/metrics-server:https has no endpoints */ tcp dpt:https reject-with icmp-port-unreachable

Additional context / logs:

[WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.43.0.1:443/version": dial tcp 10.43.0.1:443: i/o timeout time="2023-04-04T20:16:21Z" level=fatal msg="Error starting daemon: Cannot start Provisioner: failed to get Kubernetes server version: Get \"https://10.43.0.1:443/version?timeout=32s\": dial tcp 10.43.0.1:443: i/o timeout" [WARNING] plugin/kubernetes: Kubernetes API connection failure: Get "https://10.43.0.1:443/version": dial tcp 10.43.0.1:443: i/o timeout

Warning Unhealthy 3m14s (x2 over 3m17s) kubelet Readiness probe failed: Get "https://10.42.0.3:10250/readyz": context deadline exceeded (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 2m49s (x16 over 3m16s) kubelet Readiness probe failed: Get "https://10.42.0.3:10250/readyz": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) Warning Unhealthy 2m48s kubelet Readiness probe failed: Get "https://10.42.0.3:10250/readyz": dial tcp 10.42.0.3:10250: connect: connection refused

Some logs:

daemon.log not-working-kube.txt working-kube.txt

brandond commented 1 year ago

When using configserver (csf) with predefined iptables rules, we never had any issues. K3S properly creates all firewall rules (using ipv4 only). Now with ipv6 activated we get timeout

Please take a look at the thread over at https://github.com/k3s-io/k3s/issues/7203#issuecomment-1495345235 - can you confirm whether or not you have a default-drop or default-deny rule at the end of your INPUT chain?

This workstation, because it does have some firewalling enabled (corporate policies) has the INPUT chain configured with a default policy of DROP. It has rules to accept local traffic from the "normal" interfaces, but cni0 is not part of those rules.

brandond commented 1 year ago

cc @rbrtbnfgl since I believe this is related to the kube-router change we're discussing in that other thread.

rbrtbnfgl commented 1 year ago

Yes but I don't know if we had to change to ACCEPT again only for the INPUT chain when a node has a firewall that drops all traffic on it. This was the reason of upstream to maintain ACCEPT.

brandond commented 1 year ago

It does feel like the kube-router default ACCEPT rule has been covering up problems for a lot of folks that SHOULD have opened up their firewall rules for K3s. Now that we just RETURN they are running into problems because they didn't properly configure their iptables rules for K3s.

On the one hand I want to say it's working as designed (users need to properly configure their host iptables rules if they are blocking traffic), on the other it is a breaking change for users who are upgrading and suddenly configurations that previously worked now do not.

I wonder if there is a way to fix the timeout issue in https://github.com/k3s-io/k3s/issues/6691 without also breaking clusters when users upgrade on a node that doesn't have properly configured user-managed iptables rules.

HOSTED-POWER commented 1 year ago

I lost more than a day trying to fix this, what kind of "preparation" for iptables is needed? I can add them if I know which ones :)

HOSTED-POWER commented 1 year ago

It's just bad luck that we started with this ipv6 implementation at the same time 1.25.8 replaced 1.25.7, since indeed, I now installed the v1.25.7 and I have 0 issues. It's curious this happens with a minor version update, never expected this.

This is really a very breaking change as far as I can see :(

rbrtbnfgl commented 1 year ago

if you are using some firewall on the node it should be documented on the docs https://docs.k3s.io/advanced#additional-os-preparations It says to add pod and services IPs on the trusted zone.

brandond commented 1 year ago

While I appreciate that the current behavior is probably correct from a security perspective, I am very concerned that it is also a breaking change for many users who were relying on the old behavior for proper functioning of their cluster.

@rbrtbnfgl would it be possible to put the allow/return behavior behind a CLI flag that defaults to the old ALLOW by default?

cc @cwayne18 @caroline-suse-rancher

rbrtbnfgl commented 1 year ago

It does feel like the kube-router default ACCEPT rule has been covering up problems for a lot of folks that SHOULD have opened up their firewall rules for K3s. Now that we just RETURN they are running into problems because they didn't properly configure their iptables rules for K3s.

On the one hand I want to say it's working as designed (users need to properly configure their host iptables rules if they are blocking traffic), on the other it is a breaking change for users who are upgrading and suddenly configurations that previously worked now do not.

I wonder if there is a way to fix the timeout issue in #6691 without also breaking clusters when users upgrade on a node that doesn't have properly configured user-managed iptables rules.

The issue is related to the iptables rules added by kube-router at the begin of the chain. The packets are rightly marked if they need to be accepted but the ACCEPT rule is executed before the other rules on the chain. If the ACCEPT rule that match the mark is appended at the end of the chain the packets are accepted by kube-router rule but after they check every the rules on that chain.

HOSTED-POWER commented 1 year ago

While I appreciate that the current behavior is probably correct from a security perspective, I am very concerned that it is also a breaking change for many users who were relying on the old behavior for proper functioning of their cluster.

@rbrtbnfgl would it be possible to put the allow/return behavior behind a CLI flag that defaults to the old ALLOW by default?

cc @cwayne18 @caroline-suse-rancher

To be honest we were very happy with the current implementation, it made it very easy and always working. I don't see a lot of advantage to change this, since we need to open it anyway. Why not rely on k3s to do this for us?

rbrtbnfgl commented 1 year ago

I can easily change kube-router to add the ACCEPT rules at the end of the chain to maintain the previous behaviour and the fix for #6691

brandond commented 1 year ago

@rbrtbnfgl do you think you can get that in for the next release? If so I believe that would probably save us a lot of additional issues.

HOSTED-POWER commented 1 year ago

Wow I can't wait to test the fix, is there an easy way to do this? Or will it be released soon?

PS: Do we really need to uninstall k3s completely if we want to enable ipv6 on a already installed k3s? That seems a lot of hassle, but it seems to be noted in the documentation?

rbrtbnfgl commented 1 year ago

it's enough to run k3s-killall.sh and then start K3s

brandond commented 1 year ago

Wow I can't wait to test the fix, is there an easy way to do this? Or will it be released soon?

See https://github.com/k3s-io/k3s/issues/7203#issuecomment-1499797437 As @rbrtbnfgl you will need to use k3s-killall.sh to clear the iptables rules, before starting the new version.

PS: Do we really need to uninstall k3s completely if we want to enable ipv6 on a already installed k3s? That seems a lot of hassle, but it seems to be noted in the documentation?

If you want to have a dual-stack cluster, yes you should configure the dual-stack CIDRs when starting the server for the first time.

HOSTED-POWER commented 1 year ago

ok that's confusing @brandond , can we get away with the kill? What with single node installs?

The fix doesn't work for our use case it seems, but I commented in the other thread

HOSTED-POWER commented 1 year ago

Hi Brandond, we have a bunch of servers to upgrade with ipv6 support. Is the killall usable or not? :)

Is there any way/command to show the current k3s_install_options ? This could come in handy as well when running upgrades

For the rest, I tested the rc1 of 1.25 and it seems resolved so far!

brandond commented 1 year ago

Closing as duplicate of https://github.com/k3s-io/k3s/issues/7203

proligde commented 1 year ago

@rbrtbnfgl I stumbled across this issue after my (and my collegues') local k3s-backed dev environments stopped working after upgrading to 1.26.4.

My scenario is, that we use k3s as local development environment and route all our local domains to 127.0.0.1. This worked for years now, but all of a sudden ports 80 and 443 running through the svclb stopped responding on 127.0.0.1 while still working on the LAN addresses like 192.168....

After pinning the problem down to the k3s version used, I could confirm it still worked on k3s v 1.25.6, but not anymore on 1.25.9, 1.26.4 and 1.27.1 which led me to this very ticket and the PR mentioned below.

To be honest – I don't understand how PR https://github.com/k3s-io/k3s/pull/7218 should produce that behavior. On the other hand my iptables and containerd routing knowledge is very limited. So I'm wondering - is this just a red herring or what did I misconfigure here?

thanks so much in advance - Max

rbrtbnfgl commented 1 year ago

Hi @proligde could you open an issue with your setup config?