k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.1k stars 2.28k forks source link

Cannot access ports on loopback address after upgrading from `v1.27.6+k3s1` to `v1.27.7+k3s1` #8793

Closed brandond closed 8 months ago

brandond commented 8 months ago

I am creating this as a discussion because I've seen no one else having this issue.

When updating our clusters from v1.27.6+k3s1 to v1.27.7+k3s1, the embedded Etcd doesn't seem to come up correctly. Even when adding debug: true the logs don't show anything helpful, just that the etcd-client cannot connect.

Please see the attached logs: k3s.1.27.7.debug.log

It doesn't matter which of the three server nodes I try to update first. The first node I try to update always refuses to come up.

Our config.yaml looks like this:

## COMMON CONFIGURATION

debug: true
token: 'topsecret'
data-dir: '/ssd/k3s'
node-ip: '172.28.180.138'

prefer-bundled-bin: true

flannel-iface: 'eth0'

kubelet-arg:
  - 'root-dir=/ssd/k3s/kubelet'
  - 'image-gc-high-threshold=70'
  - 'image-gc-low-threshold=50'

node-taint:
- CriticalAddonsOnly=true:NoExecute

## SERVER SPECIFIC CONFIGURATION

cluster-cidr: '10.42.0.0/16'
service-cidr: '10.43.0.0/16'
flannel-backend: 'host-gw'
tls-san: [mngr1.mycompany.local, mngr2.mycompany.local, mngr3.mycompany.local]

disable-network-policy: true

etcd-expose-metrics: true

secrets-encryption: true

I wish I could describe the issue better, but the logs don't give me much to work with.

Edit: The logs created by v1.27.6+k3s1 look pretty much identical, except for the non-failing etcd-client.

Edit 2: v1.28.3+k3s1 fails for us, too. v1.26.10+k3s1 is working fine.

Originally posted by @ChristianCiach in https://github.com/k3s-io/k3s/discussions/8780

ChristianCiach commented 8 months ago

Some additional information from the original discussion:

Speculation: I think the issue happens because install.sh uses iptables-save from the host that only outputs a subset of the rules that the bundled binary would return.

I plan to create another discussion for this tomorrow, because I would like to know why the install script even attempts to remove iptables rules (from a running k3s instance no less!). The documentation also documents another way to upgrade k3s by not using the install script and only replacing the k3s binary directly. I guess (but didn't test yet!) that not using the install script to upgrade should also prevent this issue.

brandond commented 8 months ago

It appears that packets to the loopback address are being incorrectly masqueraded, which in turn causes them to be blocked by the KUBE-FIREWALL rule that prevents access to host loopback addresses from non-loopback sources.

Trace of a connection to 127.0.0.1:2399 from 1.27.6+k3s1:

trace id bd0052fb ip raw OUTPUT packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 46391 ip length 60 tcp sport 41168 tcp dport 2399 tcp flags == syn tcp window 43690
trace id bd0052fb ip raw OUTPUT rule meta l4proto tcp ip daddr 127.0.0.1 tcp dport 2399 counter packets 1 bytes 60 meta nftrace set 1 (verdict continue)
trace id bd0052fb ip raw OUTPUT verdict continue
trace id bd0052fb ip raw OUTPUT policy accept

trace id bd0052fb ip nat OUTPUT packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 46391 ip length 60 tcp sport 41168 tcp dport 2399 tcp flags == syn tcp window 43690
trace id bd0052fb ip nat OUTPUT rule  counter packets 910 bytes 55873 jump KUBE-SERVICES (verdict jump KUBE-SERVICES)
trace id bd0052fb ip nat KUBE-SERVICES rule  fib daddr type local counter packets 132 bytes 6809 jump KUBE-NODEPORTS (verdict jump KUBE-NODEPORTS)
trace id bd0052fb ip nat KUBE-NODEPORTS verdict continue
trace id bd0052fb ip nat KUBE-SERVICES verdict continue
trace id bd0052fb ip nat OUTPUT rule fib daddr type local counter packets 63 bytes 3780 jump CNI-HOSTPORT-DNAT (verdict jump CNI-HOSTPORT-DNAT)
trace id bd0052fb ip nat CNI-HOSTPORT-DNAT verdict continue
trace id bd0052fb ip nat OUTPUT verdict continue
trace id bd0052fb ip nat OUTPUT policy accept

trace id bd0052fb ip filter OUTPUT packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 46391 ip length 60 tcp sport 41168 tcp dport 2399 tcp flags == syn tcp window 43690
trace id bd0052fb ip filter OUTPUT rule  counter packets 31934 bytes 7098788 jump KUBE-ROUTER-OUTPUT (verdict jump KUBE-ROUTER-OUTPUT)
trace id bd0052fb ip filter KUBE-ROUTER-OUTPUT verdict continue
trace id bd0052fb ip filter OUTPUT rule ct state new  counter packets 30 bytes 1992 jump KUBE-PROXY-FIREWALL (verdict jump KUBE-PROXY-FIREWALL)
trace id bd0052fb ip filter KUBE-PROXY-FIREWALL verdict continue
trace id bd0052fb ip filter OUTPUT rule ct state new  counter packets 30 bytes 1992 jump KUBE-SERVICES (verdict jump KUBE-SERVICES)
trace id bd0052fb ip filter KUBE-SERVICES verdict continue
trace id bd0052fb ip filter OUTPUT rule counter packets 29469 bytes 6855653 jump KUBE-FIREWALL (verdict jump KUBE-FIREWALL)
trace id bd0052fb ip filter KUBE-FIREWALL verdict continue
trace id bd0052fb ip filter OUTPUT verdict continue
trace id bd0052fb ip filter OUTPUT policy accept

trace id bd0052fb ip nat POSTROUTING packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 46391 ip length 60 tcp sport 41168 tcp dport 2399 tcp flags == syn tcp window 43690
trace id bd0052fb ip nat POSTROUTING rule  counter packets 787 bytes 47776 jump CNI-HOSTPORT-MASQ (verdict jump CNI-HOSTPORT-MASQ)
trace id bd0052fb ip nat CNI-HOSTPORT-MASQ verdict continue
trace id bd0052fb ip nat POSTROUTING rule  counter packets 914 bytes 56102 jump KUBE-POSTROUTING (verdict jump KUBE-POSTROUTING)
trace id bd0052fb ip nat KUBE-POSTROUTING verdict return
trace id bd0052fb ip nat POSTROUTING rule  counter packets 865 bytes 52899 jump FLANNEL-POSTRTG (verdict jump FLANNEL-POSTRTG)
trace id bd0052fb ip nat FLANNEL-POSTRTG verdict continue
trace id bd0052fb ip nat POSTROUTING verdict continue
trace id bd0052fb ip nat POSTROUTING policy accept

trace id 3b0c5f4b ip raw PREROUTING packet: iif "lo" @ll,0,112 2048 ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 46391 ip length 60 tcp sport 41168 tcp dport 2399 tcp flags == syn tcp window 43690
trace id 3b0c5f4b ip raw PREROUTING rule meta l4proto tcp ip daddr 127.0.0.1 tcp dport 2399 counter packets 2 bytes 120 meta nftrace set 1 (verdict continue)
trace id 3b0c5f4b ip raw PREROUTING verdict continue
trace id 3b0c5f4b ip raw PREROUTING policy accept

trace id 3b0c5f4b ip filter INPUT packet: iif "lo" @ll,0,112 2048 ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 46391 ip length 60 tcp sport 41168 tcp dport 2399 tcp flags == syn tcp window 43690
trace id 3b0c5f4b ip filter INPUT rule  counter packets 31609 bytes 7328989 jump KUBE-ROUTER-INPUT (verdict jump KUBE-ROUTER-INPUT)
trace id 3b0c5f4b ip filter KUBE-ROUTER-INPUT verdict continue
trace id 3b0c5f4b ip filter INPUT rule ct state new  counter packets 52 bytes 2528 jump KUBE-PROXY-FIREWALL (verdict jump KUBE-PROXY-FIREWALL)
trace id 3b0c5f4b ip filter KUBE-PROXY-FIREWALL verdict continue
trace id 3b0c5f4b ip filter INPUT rule  counter packets 29508 bytes 6851425 jump KUBE-NODEPORTS (verdict jump KUBE-NODEPORTS)
trace id 3b0c5f4b ip filter KUBE-NODEPORTS verdict continue
trace id 3b0c5f4b ip filter INPUT rule ct state new  counter packets 52 bytes 2528 jump KUBE-EXTERNAL-SERVICES (verdict jump KUBE-EXTERNAL-SERVICES)
trace id 3b0c5f4b ip filter KUBE-EXTERNAL-SERVICES verdict continue
trace id 3b0c5f4b ip filter INPUT rule counter packets 29508 bytes 6851425 jump KUBE-FIREWALL (verdict jump KUBE-FIREWALL)
trace id 3b0c5f4b ip filter KUBE-FIREWALL verdict continue
trace id 3b0c5f4b ip filter INPUT verdict continue
trace id 3b0c5f4b ip filter INPUT policy accept

Trace of a connection to 127.0.0.1:2399 from a node that was upgraded from 1.27.6+k3s1 to 1.27.7+k3s1:

trace id f941b4ee ip raw OUTPUT packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 53609 ip length 60 tcp sport 51048 tcp dport 2399 tcp flags == syn tcp window 43690
trace id f941b4ee ip raw OUTPUT rule meta l4proto tcp ip daddr 127.0.0.1 tcp dport { 2379,2399} counter packets 84 bytes 5040 meta nftrace set 1 (verdict continue)
trace id f941b4ee ip raw OUTPUT verdict continue
trace id f941b4ee ip raw OUTPUT policy accept

trace id f941b4ee ip mangle OUTPUT verdict continue
trace id f941b4ee ip mangle OUTPUT policy accept

trace id f941b4ee ip nat OUTPUT packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 53609 ip length 60 tcp sport 51048 tcp dport 2399 tcp flags == syn tcp window 43690
trace id f941b4ee ip nat OUTPUT rule fib daddr type local counter packets 476 bytes 28560 jump CNI-HOSTPORT-DNAT (verdict jump CNI-HOSTPORT-DNAT)
trace id f941b4ee ip nat CNI-HOSTPORT-DNAT verdict continue
trace id f941b4ee ip nat OUTPUT verdict continue
trace id f941b4ee ip nat OUTPUT policy accept

trace id f941b4ee ip filter OUTPUT packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 53609 ip length 60 tcp sport 51048 tcp dport 2399 tcp flags == syn tcp window 43690
trace id f941b4ee ip filter OUTPUT rule  counter packets 20869 bytes 5222004 jump KUBE-ROUTER-OUTPUT (verdict jump KUBE-ROUTER-OUTPUT)
trace id f941b4ee ip filter KUBE-ROUTER-OUTPUT verdict continue
trace id f941b4ee ip filter OUTPUT rule ct state new  counter packets 2622 bytes 162617 jump KUBE-PROXY-FIREWALL (verdict jump KUBE-PROXY-FIREWALL)
trace id f941b4ee ip filter KUBE-PROXY-FIREWALL verdict continue
trace id f941b4ee ip filter OUTPUT rule ct state new  counter packets 2622 bytes 162617 jump KUBE-SERVICES (verdict jump KUBE-SERVICES)
trace id f941b4ee ip filter KUBE-SERVICES verdict continue
trace id f941b4ee ip filter OUTPUT rule counter packets 20781 bytes 5214054 jump KUBE-FIREWALL (verdict jump KUBE-FIREWALL)
trace id f941b4ee ip filter KUBE-FIREWALL verdict continue
trace id f941b4ee ip filter OUTPUT verdict continue
trace id f941b4ee ip filter OUTPUT policy accept

trace id f941b4ee ip mangle POSTROUTING verdict continue
trace id f941b4ee ip mangle POSTROUTING policy accept

trace id f941b4ee ip nat POSTROUTING packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 53609 ip length 60 tcp sport 51048 tcp dport 2399 tcp flags == syn tcp window 43690
trace id f941b4ee ip nat POSTROUTING rule  counter packets 1293 bytes 80403 jump CNI-HOSTPORT-MASQ (verdict jump CNI-HOSTPORT-MASQ)
trace id f941b4ee ip nat CNI-HOSTPORT-MASQ rule counter packets 1293 bytes 80403 masquerade  (verdict accept)

trace id 4eb0d1cf ip raw PREROUTING packet: iif "lo" @ll,0,112 2048 ip saddr 172.31.10.14 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 53609 ip length 60 tcp sport 51048 tcp dport 2399 tcp flags == syn tcp window 43690
trace id 4eb0d1cf ip raw PREROUTING rule meta l4proto tcp ip daddr 127.0.0.1 tcp dport { 2379,2399} counter packets 558 bytes 33480 meta nftrace set 1 (verdict continue)
trace id 4eb0d1cf ip raw PREROUTING verdict continue
trace id 4eb0d1cf ip raw PREROUTING policy accept

trace id 4eb0d1cf ip mangle PREROUTING verdict continue
trace id 4eb0d1cf ip mangle PREROUTING policy accept
trace id 4eb0d1cf ip mangle INPUT verdict continue
trace id 4eb0d1cf ip mangle INPUT policy accept

trace id 4eb0d1cf ip filter INPUT packet: iif "lo" @ll,0,112 2048 ip saddr 172.31.10.14 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 53609 ip length 60 tcp sport 51048 tcp dport 2399 tcp flags == syn tcp window 43690
trace id 4eb0d1cf ip filter INPUT rule  counter packets 22321 bytes 63209641 jump KUBE-ROUTER-INPUT (verdict jump KUBE-ROUTER-INPUT)
trace id 4eb0d1cf ip filter KUBE-ROUTER-INPUT verdict continue
trace id 4eb0d1cf ip filter INPUT rule ct state new  counter packets 3830 bytes 212938 jump KUBE-PROXY-FIREWALL (verdict jump KUBE-PROXY-FIREWALL)
trace id 4eb0d1cf ip filter KUBE-PROXY-FIREWALL verdict continue
trace id 4eb0d1cf ip filter INPUT rule  counter packets 22231 bytes 63203120 jump KUBE-NODEPORTS (verdict jump KUBE-NODEPORTS)
trace id 4eb0d1cf ip filter KUBE-NODEPORTS verdict continue
trace id 4eb0d1cf ip filter INPUT rule ct state new  counter packets 3830 bytes 212938 jump KUBE-EXTERNAL-SERVICES (verdict jump KUBE-EXTERNAL-SERVICES)
trace id 4eb0d1cf ip filter KUBE-EXTERNAL-SERVICES verdict continue
trace id 4eb0d1cf ip filter INPUT rule counter packets 22231 bytes 63203120 jump KUBE-FIREWALL (verdict jump KUBE-FIREWALL)
trace id 4eb0d1cf ip filter KUBE-FIREWALL rule ip saddr != 127.0.0.0/8 ip daddr 127.0.0.0/8  ct state != related,established counter packets 2323 bytes 139380 drop (verdict drop)
brandond commented 8 months ago

The key difference appears to be in the nat POSTROUTING chain, which jumps into CNI-HOSTPORT-MASQ. This is now matching the outbound packet and triggering the masquerade:

Before:

trace id bd0052fb ip nat POSTROUTING packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 46391 ip length 60 tcp sport 41168 tcp dport 2399 tcp flags == syn tcp window 43690
trace id bd0052fb ip nat POSTROUTING rule  counter packets 787 bytes 47776 jump CNI-HOSTPORT-MASQ (verdict jump CNI-HOSTPORT-MASQ)
trace id bd0052fb ip nat CNI-HOSTPORT-MASQ verdict continue
trace id bd0052fb ip nat POSTROUTING rule  counter packets 914 bytes 56102 jump KUBE-POSTROUTING (verdict jump KUBE-POSTROUTING)
trace id bd0052fb ip nat KUBE-POSTROUTING verdict return
trace id bd0052fb ip nat POSTROUTING rule  counter packets 865 bytes 52899 jump FLANNEL-POSTRTG (verdict jump FLANNEL-POSTRTG)
trace id bd0052fb ip nat FLANNEL-POSTRTG verdict continue
trace id bd0052fb ip nat POSTROUTING verdict continue
trace id bd0052fb ip nat POSTROUTING policy accept

# packet still from 127.0.1.1
trace id 3b0c5f4b ip raw PREROUTING packet: iif "lo" @ll,0,112 2048 ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 46391 ip length 60 tcp sport 41168 tcp dport 2399 tcp flags == syn tcp window 43690

After:

trace id f941b4ee ip nat POSTROUTING packet: oif "lo" ip saddr 127.0.0.1 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 53609 ip length 60 tcp sport 51048 tcp dport 2399 tcp flags == syn tcp window 43690
trace id f941b4ee ip nat POSTROUTING rule  counter packets 1293 bytes 80403 jump CNI-HOSTPORT-MASQ (verdict jump CNI-HOSTPORT-MASQ)
trace id f941b4ee ip nat CNI-HOSTPORT-MASQ rule counter packets 1293 bytes 80403 masquerade  (verdict accept)

# note that the packet has now been masqueraded and has a source address of 127.31.10.14 instead of 127.0.0.1
trace id 4eb0d1cf ip raw PREROUTING packet: iif "lo" @ll,0,112 2048 ip saddr 172.31.10.14 ip daddr 127.0.0.1 ip dscp cs0 ip ecn not-ect ip ttl 64 ip id 53609 ip length 60 tcp sport 51048 tcp dport 2399 tcp flags == syn tcp window 43690

Notably, the CNI-HOSTPORT-MASQ rule no longer matches the mark, but instead matches all packets:

[root@ip-172-31-4-16 ~]# /var/lib/rancher/k3s/data/current/bin/aux/xtables-nft-multi iptables-nft-save 2>/dev/null | grep CNI-HOSTPORT-MASQ
:CNI-HOSTPORT-MASQ - [0:0]
-A POSTROUTING -m comment --comment "CNI portfwd requiring masquerade" -j CNI-HOSTPORT-MASQ
-A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE

vs

root@ip-172-31-10-14 ~]# /var/lib/rancher/k3s/data/current/bin/aux/xtables-nft-multi iptables-nft-save 2>/dev/null | grep CNI-HOSTPORT-MASQ
:CNI-HOSTPORT-MASQ - [0:0]
-A POSTROUTING -m comment --comment "CNI portfwd requiring masquerade" -j CNI-HOSTPORT-MASQ
-A CNI-HOSTPORT-MASQ -j MASQUERADE
brandond commented 8 months ago

the CNI-HOSTPORT-MASQ rule comes from the portmap plugin, so I suspect this is related to

brandond commented 8 months ago

I can confirm that just clearing the rule allows K3s to start successfully. I don’t even have to do anything, just let it retry and it picks up after a minute.

iptables -t nat -F CNI-HOSTPORT-MASQ
brandond commented 8 months ago

The issue appears to be that the host's iptables-save is buggy and does not properly output the mark match:

[root@ip-172-31-10-14 ~]# /var/lib/rancher/k3s/data/current/bin/aux/iptables-save | grep CNI-HOSTPORT-MASQ
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them
:CNI-HOSTPORT-MASQ - [0:0]
-A POSTROUTING -m comment --comment "CNI portfwd requiring masquerade" -j CNI-HOSTPORT-MASQ
-A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE

[root@ip-172-31-10-14 ~]# iptables-save | grep CNI-HOSTPORT-MASQ
:CNI-HOSTPORT-MASQ - [0:0]
-A POSTROUTING -m comment --comment "CNI portfwd requiring masquerade" -j CNI-HOSTPORT-MASQ
-A CNI-HOSTPORT-MASQ -j MASQUERADE
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them

Anything that calls iptables-save/iptables-restore after k3s has started will break connectivity to localhost. This does not appear to be new; if I go back to 1.27.4 prior to the update of CNI plugins, I see that using the host iptable-save will still break k3s if using the embedded etcd. All you have to do is:

systemctl stop k3s
iptables-save | iptables-restore
systemcl start k3s

Running k3s-killall.sh will wipe the CNI rules, which will allow K3s to start up again successfully - at least until the next time the rules are corrupted by the broken host tools.

If run with prefer-bundled-bin not enabled, then the mark rules are properly dumped by the host tools:

root@ip-172-31-10-14 ~]# iptables-save | grep CNI-HOSTPORT-MASQ
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them
:CNI-HOSTPORT-MASQ - [0:0]
-A POSTROUTING -m comment --comment "CNI portfwd requiring masquerade" -j CNI-HOSTPORT-MASQ
-A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE

This makes me suspect that this has been a problem with the bundled iptables binaries since we last bumped the buildroot version way back in https://github.com/k3s-io/k3s/pull/6400. That would mean that this issue has been present since 1.25.5+k3s1. Sure enough, if I install 1.25.4+k3s1 it works fine:

[root@ip-172-31-10-14 ~]# k3s --version
k3s version v1.25.4+k3s1 (0dc63334)
go version go1.19.3

[root@ip-172-31-10-14 ~]# grep prefer-bundled /etc/rancher/k3s/config.yaml
prefer-bundled-bin: true

[root@ip-172-31-10-14 ~]# iptables-save | grep CNI-HOSTPORT-MASQ
:CNI-HOSTPORT-MASQ - [0:0]
-A POSTROUTING -m comment --comment "CNI portfwd requiring masquerade" -j CNI-HOSTPORT-MASQ
-A CNI-HOSTPORT-MASQ -m mark --mark 0x2000/0x2000 -j MASQUERADE
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them

I'm not sure what might have changed recently to make this more of an issue; perhaps something has been added to these distros that calls iptables-save/iptables-restore on restart? However, since this is apparently not due to any recent changes in K3s, I think we should best handle it via documentation noting that the host iptables save/restore tools MUST NOT be used alongside the bundled iptables bins on EL7 distros. Users are welcome to use the K3s bundled bins in preference to the host bins system-wide by running:

ln -sf /var/lib/rancher/k3s/data/current/bin/aux/xtables-nft-multi /sbin/xtables-nft-multi
ChristianCiach commented 8 months ago

However, since this is apparently not due to any recent changes in K3s, I think we should best handle it via documentation noting that the host iptables save/restore tools MUST NOT be used alongside the bundled iptables bins on EL7 distros.

So the official install script should not be used to upgrade k3s in these cases, as it is hardcoded to use the host's tools? Should we just switch out the binary and restart K3s? Both upgrade mechanisms are documented at k3s.io.

We are currently preferring to use the install script to automatically pick up changes to the systemd service file.

on EL7 distros.

You probably mean EL8?

brandond commented 8 months ago

So the official install script should not be used to upgrade k3s in these cases, as it is hardcoded to use the host's tools?

That is a good point; we should probably update the changes from https://github.com/k3s-io/k3s/pull/7274 to use the bundled versions, or perhaps filter out the CNI rules that it will break. I'll have to think on that for a moment. We can't prevent anyone from breaking it themselves by using the host save/restore tools, but we should at least not do it ourselves.

You probably mean EL8?

Yes.

As far as I can tell, using the host iptables-save/iptables-restore commands provided by EL8's iptables package to manage rules created by the k3s bundled iptables binaries has been broken since v1.25.5+k3s1. There were no changes between any of the versions you're using that would have changed the behavior, and indeed I can reproduce it at any time just by doing a save|restore and restarting k3s - no upgrade necessary.

ChristianCiach commented 8 months ago

and indeed I can reproduce it at any time just by doing a save|restore and restarting k3s - no upgrade necessary.

But still, using the provided vagrant reproducer I can only reproduce this issue by upgrading to v1.27.7+k3s1. Upgrades to earlier versions work fine. Edit: Actually, no, this fails, too. I couldn't reproduce the issue reliably, because I didn't wait long enough between the initial install and the upgrade. I needed to increase the wait time to 60 seconds until I could reproduce the issue reliably.

I think there is very little to gain in understanding this little detail, so I am fine with the current state of the investigation :)

brandond commented 8 months ago

using the provided vagrant reproducer I can only reproduce this issue by upgrading to v1.27.7+k3s1.

Are you seeing anything different than I am with regards to the host iptables-save/iptables-restore commands dropping the --mark 0x2000/0x2000 option from the portmap cni rules? This appears to be the root cause of the issue, and is reproducible going all the way back to v1.25.5+k3s1.

The install script was changed to call iptables-save/iptables-restore in April, and I'm not seeing anything newer than that related to iptables.

ChristianCiach commented 8 months ago

Are you seeing anything different than I am with regards to the host iptables-save/iptables-restore commands dropping the --mark 0x2000/0x2000 option from the portmap cni rules?

No, I can confirm all of your findings (even though I don't really understand what these --mark thingies are good for or even how you did the connection traces. I feel out of my league here).

I was also wrong by claiming that upgrades to versions before v1.27.6+k3s1 are working. Well, it does work - for us. But inside the vanilla AlmaLinux 8 VM using Vagrant, upgrades to previous versions of K3s fail in the same way.

So why did upgrades to previous versions of K3s work for us? I am guessing that this has something to do with the custom iptables rules we deploy to our machines, which somehow made upgrades work up to and including v1.27.6+k3s1. We may never really know, but I am actually fine with that.

I see that you already prepared a PR to check for a faulty iptables-save inside install.sh. Seems a bit hacky, but it's still way better than any solutions I would've come up with. Thanks! I will probably migrate away from install.sh for upgrade purposes anyway. Neither the system-upgrade-controller nor https://github.com/k3s-io/k3s-ansible do rely on install.sh, so I see very little reason for me to keep relying on it.

Thank you for taking my very unstructured "discussion" seriously!

ShylajaDevadiga commented 8 months ago

Validated using k3s version v1.28.3-rc3+k3s2 on single node and multi node setup

Config.yaml

write-kubeconfig-mode: 644
cluster-init: true
token: <TOKEN>
prefer-bundled-bin: true

Version installed

]$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-172-31-2-207.us-east-2.compute.internal   Ready    control-plane,etcd,master   88s   v1.28.2+k3s1
[rocky@ip-172-31-2-207 ~]$ kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS      RESTARTS   AGE
kube-system   coredns-6799fbcd5-55wmr                   1/1     Running     0          77s
kube-system   helm-install-traefik-crd-5fgmk            0/1     Completed   0          77s
kube-system   helm-install-traefik-qgtpq                0/1     Completed   1          77s
kube-system   local-path-provisioner-84db5d44d9-rjph5   1/1     Running     0          77s
kube-system   metrics-server-67c658944b-xtf6z           1/1     Running     0          77s
kube-system   svclb-traefik-bd0ba429-5jwt7              2/2     Running     0          50s
kube-system   traefik-7bf7d7576d-7vltg                  1/1     Running     0          50s

Successful upgrade

$ kubectl get nodes
NAME                                         STATUS   ROLES                       AGE   VERSION
ip-172-31-2-207.us-east-2.compute.internal   Ready    control-plane,etcd,master   5m    v1.28.3-rc3+k3s2
[rocky@ip-172-31-2-207 ~]$ kubectl get pods -A
NAMESPACE     NAME                                      READY   STATUS      RESTARTS      AGE
kube-system   coredns-6799fbcd5-55wmr                   0/1     Running     1 (83s ago)   4m49s
kube-system   helm-install-traefik-crd-5fgmk            0/1     Completed   0             4m49s
kube-system   helm-install-traefik-qgtpq                0/1     Completed   1             4m49s
kube-system   local-path-provisioner-84db5d44d9-rjph5   1/1     Running     1 (83s ago)   4m49s
kube-system   metrics-server-67c658944b-xtf6z           0/1     Running     1 (83s ago)   4m49s
kube-system   svclb-traefik-bd0ba429-5jwt7              2/2     Running     2 (83s ago)   4m22s
kube-system   traefik-7bf7d7576d-7vltg                  1/1     Running     1 (83s ago)   4m22s