kubeadm controlplane doesn't roll back in case of upgrade error

kashifest commented 4 years ago

What steps did you take and what happened:

While upgrading KCP, if there is an upgrade failure due to a bug in cloud-init, the machine is requed. This is expected. However, when the KCP is edited again and cloud-init error is fixed, KCP still tries to run the previous version of the bootstrap in KCP ignoring the corrected version of KCP.

What did you expect to happen: KCP would roll back the problematic KCP upgrade and keep into account the new change applied on KCP and roll out new CP with correct cloud-init. There should be a way to re-apply upgrade in case upgrade fails.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

A typical example would be to add an erroneous command in prekubeadm like ap update. This is the error

controllers/DockerMachine/DockerMachine-controller "msg"="Failed running command" "cluster"="my-cluster" "docker-cluster"="my-cluster" "docker-machine"={"Namespace":"default","Name":"controlplane-bjw68"} "machine"="controlplane-nsdm6" "command"={"Cmd":"/bin/sh","Args":["-c","ap update"],"Stdin":""} "stderr"="/bin/sh: 1: ap: not found\n"
[manager] I0813 15:19:46.343669       9 dockermachine_controller.go:200] controllers/DockerMachine/DockerMachine-controller "msg"="failed to exec DockerMachine bootstrap: failed to run cloud conifg: command \"docker exec --privileged my-cluster-controlplane-nsdm6 /bin/sh -c 'ap update'\" failed with error: exit status 127, cleaning up so we can re-provision from a clean state" "cluster"="my-cluster" "docker-cluster"="my-cluster" "docker-machine"={"Namespace":"default","Name":"controlplane-bjw68"} "machine"="controlplane-nsdm6" 
[manager] I0813 15:19:46.343698       9 machine.go:312] controllers/DockerMachine/DockerMachine-controller "msg"="Deleting machine container" "cluster"="my-cluster" "docker-cluster"="my-cluster" "docker-machine"={"Namespace":"default","Name":"controlplane-bjw68"} "machine"="controlplane-nsdm6" 
[manager] I0813 15:19:47.096628       9 generic_predicates.go:162]

Now since there was a typo in prekubeadm command, once KCP is edited again and the command is corrected to apt update, it can be seen that this is overlooked and docker controller was still trying to provision a new KCP but still with the old bootstrap data which had the typo in prekubeadm.

/kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

fabriziopandini commented 4 years ago

@kashifest, unfortunately, CAPD is not the best solution to test this behavior due to the custom cloud-init interpreter it uses (see https://github.com/kubernetes-sigs/cluster-api/issues/3488)

I'm going to try to reproduce with another infrastructure provider

fabriziopandini commented 4 years ago

The behavior in other providers is slightly different. In fact, cloud-init logs are reporting a failure in executing a command, but then thee execution moved on the next command (kubeadm join in this case)

/var/lib/cloud/instance/scripts/runcmd: 7: /var/lib/cloud/instance/scripts/runcmd: ap: not found
+++ [2020-08-17T13:10:53+00:00] running 'kubeadm join phase preflight --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests'
....

That means that, in case of those types of errors, KCP simply goes on with the rollout because the machines come up and the node successfully joins the cluster (in fact the error is swallowed by cluod-init 😐).

Other types of provisioning error could lead to different behaviors:

In case of errors in creating the machines, CAPI/KCP will continue to retry the operation
In case of errors preventing the node successfully joins the cluster. CAPI/KCP will continue to wait indefinitely, so the user is requested to fix the problem or to delete the machine; after https://github.com/kubernetes-sigs/cluster-api/pull/3185 lands, machine health check could be used to automatically remediate failing machines (delete); this will trigger the creation of a new machine - with the latest configuration -

FYI https://github.com/kubernetes-sigs/cluster-api/issues/3138 is expected to provide more visibility on what is happening on the node, and hopefully make it simpler for the user to determine the next action.

@kashifest Please let me know if this is enough to answer this question /priority awaiting-more-evidence

k8s-ci-robot commented 4 years ago

@fabriziopandini: The label(s) priority/, priority/ cannot be applied, because the repository doesn't have them

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/3483#issuecomment-674893469): >The behavior in other providers is slightly different. >In fact, cloud-init logs are reporting a failure in executing a command, but then thee execution moved on the next command (kubeadm join in this case) > >``` >/var/lib/cloud/instance/scripts/runcmd: 7: /var/lib/cloud/instance/scripts/runcmd: ap: not found >+++ [2020-08-17T13:10:53+00:00] running 'kubeadm join phase preflight --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests' >.... >``` > >That means that, in case of those types of errors, KCP simply goes on with the rollout because the machines come up and the node successfully joins the cluster (in fact the error is swallowed by cluod-init 😐). > >Other types of provisioning error could lead to different behaviors: >- In case of errors in creating the machines, CAPI/KCP will continue to retry the operation >- In case of errors preventing the node successfully joins the cluster. CAPI/KCP will continue to wait indefinitely, so the user is requested to fix the problem or to delete the machine; after https://github.com/kubernetes-sigs/cluster-api/pull/3185 lands, machine health check could be used to automatically remediate failing machines (delete); this will trigger the creation of a new machine - with the latest configuration - > >FYI https://github.com/kubernetes-sigs/cluster-api/issues/3138 is expected to provide more visibility on what is happening on the node, and hopefully make it simpler for the user to determine the next action. > >@kashifest Please let me know if this is enough to answer this question >/priority awaiting-more-evidence > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kashifest commented 4 years ago

@fabriziopandini thanks for the update, I will check and update if I can get rid of the errors with other infra

vincepri commented 4 years ago

/milestone Next

kashifest commented 4 years ago

@fabriziopandini I understand your point that the example error that I showed in the original issue description , that is swallowed by cloud init, but the issue of not being able to roll-back in case of upgrade failure remains. What I meant is that, the example given here with CAPD and cloudinit might not be right example, lets consider this with another example:

We have initialized KCP with 3 replicas with k8s v1.18.0 We edited KCP to upgrade to k8s v1.18.2. Everything worked smoothly. Now, we want KCP to upgrade to k8s v1.18.4. We edit the KCP and accidentally put the k8s version to v1.18.3. It gives the following error:

[manager] Unable to find image 'kindest/node:v1.18.3' locally
[manager] docker: Error response from daemon: manifest for kindest/node:v1.18.3 not found: manifest unknown: manifest unknown.
[manager] See 'docker run --help'.
[manager] I0826 12:09:46.314683       8 dockermachine_controller.go:200] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: command \"docker run --detach --tty --privileged --security-opt seccomp=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro --hostname my-cluster-my-controlplane-lgp6m --name my-cluster-my-controlplane-lgp6m --label io.x-k8s.kind.cluster=my-cluster --label io.x-k8s.kind.role=control-plane --expose 38845 --publish=127.0.0.1:38845:6443/TCP kindest/node:v1.18.3\" failed with error: exit status 125, cleaning up so we can re-provision from a clean state" "cluster"="my-cluster" "docker-cluster"="my-cluster" "docker-machine"={"Namespace":"default","Name":"my-controlplane-vvvtg"} "machine"="my-controlplane-lgp6m" 
[manager] E0826 12:09:46.350051       8 dockermachine_controller.go:129] controllers/DockerMachine/DockerMachine-controller "msg"="failed to patch DockerMachine" "error"="error patching conditions: The condition \"Ready\" was modified by a different process and this caused a merge/AddCondition conflict:   \u0026v1alpha3.Condition{\n  \tType:               \"Ready\",\n  \tStatus:             \"False\",\n- \tSeverity:           \"Info\",\n+ \tSeverity:           \"Warning\",\n- \tLastTransitionTime: v1.Time{Time: s\"2020-08-26 12:09:41 +0000 UTC\"},\n+ \tLastTransitionTime: v1.Time{Time: s\"2020-08-26 12:09:46 +0000 UTC\"},\n- \tReason:             \"WaitingForBootstrapData\",\n+ \tReason:             \"ContainerProvisioningFailed\",\n  \tMessage:            \"0 of 2 completed\",\n  }\n" "cluster"="my-cluster" "docker-cluster"="my-cluster" "docker-machine"={"Namespace":"default","Name":"my-controlplane-vvvtg"} "machine"="my-controlplane-lgp6m" 
[manager] E0826 12:09:46.350207       8 controller.go:248] controller-runtime/controller "msg"="Reconciler error" "error"="error patching conditions: The condition \"Ready\" was modified by a different process and this caused a merge/AddCondition conflict:   \u0026v1alpha3.Condition{\n  \tType:               \"Ready\",\n  \tStatus:             \"False\",\n- \tSeverity:           \"Info\",\n+ \tSeverity:           \"Warning\",\n- \tLastTransitionTime: v1.Time{Time: s\"2020-08-26 12:09:41 +0000 UTC\"},\n+ \tLastTransitionTime: v1.Time{Time: s\"2020-08-26 12:09:46 +0000 UTC\"},\n- \tReason:             \"WaitingForBootstrapData\",\n+ \tReason:             \"ContainerProvisioningFailed\",\n  \tMessage:            \"0 of 2 completed\",\n  }\n" "controller"="dockermachine" "name"="my-controlplane-vvvtg" "namespace"="default"

Now we realize the mistake and edit kcp again and fix k8s to v1.18.4 but we still see that it is still trying to rollout KCP with v1.18.3. No rollback of the previous case is happening and new changes are not respected.

fabriziopandini commented 4 years ago

@kashifest I got your point.

I think that what you are calling rollback is similar to the problem that we are trying to fix with the machine health check: how to identify and get rid of broken control plane nodes in a controlled wait (e.g without losing quorum).

In fact, I assume that after #3185 lands, the problem of getting rid of the v1.18.3 machine (and thus unblock the installation of v1.18.4 ones) could be already covered by the machine health check if a NodeStartupTimeout is properly configured

// Machines older than this duration without a node will be considered to have
// failed and will be remediated.
// +optional
NodeStartupTimeout *metav1.Duration `json:"nodeStartupTimeout,omitempty"`

Might be someone else could chime in and confirm my assumption here

vincepri commented 4 years ago

/milestone Next

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fabriziopandini commented 3 years ago

given that KCP remediation is merged, I'm +1 for closing this issue. Opinions?

fejta-bot commented 3 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

vincepri commented 3 years ago

/close

k8s-ci-robot commented 3 years ago

@vincepri: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/cluster-api/issues/3483#issuecomment-753790596): >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-sigs / cluster-api

kubeadm controlplane doesn't roll back in case of upgrade error #3483