rancher / k3os

Purpose-built OS for Kubernetes, fully managed by Kubernetes.
https://k3os.io
Apache License 2.0
3.5k stars 402 forks source link

K3OS kube-proxy fails after Upgrade from v0.19.5-rc.1 / v0.19.8-k3s1r0 to v0.20.4-k3s1r0 #686

Open containerguy opened 3 years ago

containerguy commented 3 years ago

Version (k3OS / kernel) k3os version v0.20.4-k3s1r0 5.4.0-70-generic #78 SMP Fri Mar 26 17:09:23 UTC 2021

Architecture x86_64

Describe the bug

kube-proxy not reachable due to a broken ip tables rule regarding /var/log/k3s-service.log

Log line: F0428 11:24:28.651945 9365 network_policy_controller.go:261] Failed to verify rule exists in KUBE-ROUTER-INPUT chain due to running [/sbin/iptables -t filter -C KUBE-ROUTER-INPUT -p tcp -m comment --comment allow LOCAL TCP traffic to node ports - NTGHJFAJAOEPWU6M -m addrtype --dst-type LOCAL -m multiport --dports -j RETURN --wait]: exit status 2: iptables v1.8.4 (legacy): invalid port/service ' specified Tryiptables -h' or 'iptables --help' for more information.

The same error occurs if I upgrade from v0.19.5 to v0.20.4 To Reproduce

Upgrade one K3OS Node to v0.20.4-k3s1r0

Expected behavior IPTables rules work with the previous k3os versions and should also work with a newer version as this is managed by k3s.

Actual behavior IPTables rules are broken, cluster tries to start kube-proxy several times (as seen in the kubernetes events) Node unusable until downgrade to previous version

Additional context Multi Master HA Cluster with 3x master und 3x worker nodes, virtualized on VMware k3s-service.log

brandond commented 3 years ago

This is an issue with K3s and not K3os; please see k3s-io/k3s#2996. You must make sure the server is the same version or newer than the agent. When upgrading, servers must ALL be upgraded before upgrading agents.

ssmiller25 commented 3 years ago

Thanks for the info @brandond! I ran into the exact same error, and after forcing the master to upgrade, everything came back online!

The root cause of my issue was that I had set my entire cluster to auto-upgrade via the system-upgrade-controller - both masters and agents. Since there is a single plan, and an agent was the first attempted upgrade, the whole process stopped when that node failed to come back online. Still thinking of potential solutions, but maybe easier to just mention in the Readme for upgrades from 1.19 to 1.20 (since this seems related to the implementation of network-policy in 1.20, if I'm following the issues correctly in my head)

brandond commented 3 years ago

Prior upgrades may have been more forgiving about this, but it is a standard part of the Kubernetes version skew policy that servers always need to be upgraded before agents. If you're using the upgrade controller, you should definitely use separate plans for servers and agents, and upgrade the servers first - ESPECIALLY when upgrading between minor versions.

https://kubernetes.io/docs/setup/release/version-skew-policy/#supported-component-upgrade-order

ssmiller25 commented 3 years ago

Thanks for the info! Based on this, I wonder if the default k3os-latest plan for the system-upgrade-controller bundled with K3OS should be adjusted, as the way it's configured now it won't pay attention to if controllers are upgraded first. The k3os.io/upgrade label could be adjusted to manually control the upgrade process, but kind of defaults the point of an automatic upgrade controller. At a minimum this should probably be mentioned in the Upgrade and Maintenance section of the readme.

brandond commented 3 years ago

@dweomer the ask above regarding the default plan seems reasonable, is this something that could be accommodated without too much work?

containerguy commented 3 years ago

You are totally right, after manually upgrading server and afterwards the workers, it is working fine. I suggest at least the documentation for using an upgrade plan dedicated for servers / agents could be updated to reflect this more clearly.

dweomer commented 3 years ago

@dweomer the ask above regarding the default plan seems reasonable, is this something that could be accommodated without too much work?

i never got around to it but the k3os-latest plan needs to be deprecated in favor of 2 plans, one for servers and one for all other agents. the thing about the k3os-latest plan, though, is that it was meant to be an example but it's basically the defacto upgrade descriptor.