Tolerate held packages during cluster repair

jzink-tss commented 3 years ago

What happened: After one of our control planes in the staging cluster failed, I followed the cluster repair guide in order to set up a new node:

kubectl exec'd into a working node
removed the NotReady node from etcd ring
ran terraform apply so that a new server was created
ran kubeone apply in order to install kubernetes on it and let it join the node into the cluster

But then, on the existing control planes, the following error occured (error msg unescaped and shortened for better readability):

failed to install kubeadm: + export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/sbin:/usr/local/bin:/opt/bin
+ PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/sbin:/usr/local/bin:/opt/bin
[...]
+ sudo add-apt-repository 'deb https://download.docker.com/linux/ubuntu focal stable'
+ sudo apt-get install -y 'containerd.io=1.4.*'
E: Held packages were changed and -y was used without --allow-change-held-packages.
: Process exited with status 100

This seems to happen because there are several packages held back by default:

# apt-mark showhold
containerd.io
kubeadm
kubectl
kubelet
kubernetes-cni

When I "unheld" them by running apt-mark unhold containerd.io kubeadm kubectl kubelet kubernetes-cni, everything worked as expected (and described in the guide).

You should either add this step to the guide or implement toleration of held packages (e.g. use --allow-change-held-packages).

Anything else we need to know? containerd.io was 1.4.9 before, which is matched by 'containerd.io=1.4.*'. Maybe that caused the problem in the first place?

Information about the environment: KubeOne version (kubeone version):

{
  "kubeone": {
    "major": "1",
    "minor": "3",
    "gitVersion": "1.3.0",
    "gitCommit": "bfe6683334acdbb1a1d9cbbb2d5d5095f6f0111e",
    "gitTreeState": "",
    "buildDate": "2021-09-15T06:03:30Z",
    "goVersion": "go1.16.7",
    "compiler": "gc",
    "platform": "linux/amd64"
  },
  "machine_controller": {
    "major": "1",
    "minor": "35",
    "gitVersion": "v1.35.2",
    "gitCommit": "",
    "gitTreeState": "",
    "buildDate": "",
    "goVersion": "",
    "compiler": "",
    "platform": "linux/amd64"
  }
}

Operating system: Alpine (Docker image: hashicorp/terraform:1.0.8 + kubeone installed with wget) Provider you're deploying cluster on: Hetzner Operating system you're deploying on: Ubuntu 20.04

shaase-ctrl commented 3 years ago

Thank you for reporting this to us.

c4tz commented 7 months ago

Just as past-me, I just stumbled upon this, while fixing a faulty control plane using the manual cluster repair guide.

I got almost the same error (its just not containerd this time, but k8s itself), but cannot seem to find why the fix which was applied the first time I found this doesn't work anymore.

I'm currently on the release/v1.7 branch (commit 3edc498), because I had this issue before and needed both fixes to be included in the KubeOne version I used.

I also tried using --force-upgrade, because I had a look at the fix and found this condition to might cause the problem now.

OS is Ubuntu 22.04, the control plane in question is freshly created (because the old one was faulty and the guide tells you to delete and re-create it). The remaining 2 control planes are on the same OS and kubernetes version, but have of course been provisioned by an old version of KubeOne, if that helps.

Log: log.txt

PS: Unfotunately, I'll be away for a month, so I cannot test proposed solutions in the meantime.

xmudrii commented 7 months ago

Thanks for reporting! I'll reopen the issue so we can verify it on our side /reopen

kubermatic-bot commented 7 months ago

@xmudrii: Reopened this issue.

In response to [this](https://github.com/kubermatic/kubeone/issues/1578#issuecomment-2079298123): >Thanks for reporting! I'll reopen the issue so we can verify it on our side >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

c4tz commented 5 months ago

I got the workaround working again (unholding the pkgs manually, on all 3 Control Planes this time). But as the underyling problem seems to persist, I'll leave this issue open.

kubermatic-bot commented 1 week ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

xmudrii commented 1 week ago

/remove-lifecycle stale

kubermatic / kubeone

Tolerate held packages during cluster repair #1578