Closed kashifest closed 3 years ago
@kashifest, unfortunately, CAPD is not the best solution to test this behavior due to the custom cloud-init interpreter it uses (see https://github.com/kubernetes-sigs/cluster-api/issues/3488)
I'm going to try to reproduce with another infrastructure provider
The behavior in other providers is slightly different. In fact, cloud-init logs are reporting a failure in executing a command, but then thee execution moved on the next command (kubeadm join in this case)
/var/lib/cloud/instance/scripts/runcmd: 7: /var/lib/cloud/instance/scripts/runcmd: ap: not found
+++ [2020-08-17T13:10:53+00:00] running 'kubeadm join phase preflight --ignore-preflight-errors=DirAvailable--etc-kubernetes-manifests'
....
That means that, in case of those types of errors, KCP simply goes on with the rollout because the machines come up and the node successfully joins the cluster (in fact the error is swallowed by cluod-init 😐).
Other types of provisioning error could lead to different behaviors:
FYI https://github.com/kubernetes-sigs/cluster-api/issues/3138 is expected to provide more visibility on what is happening on the node, and hopefully make it simpler for the user to determine the next action.
@kashifest Please let me know if this is enough to answer this question /priority awaiting-more-evidence
@fabriziopandini: The label(s) priority/, priority/
cannot be applied, because the repository doesn't have them
@fabriziopandini thanks for the update, I will check and update if I can get rid of the errors with other infra
/milestone Next
@fabriziopandini I understand your point that the example error that I showed in the original issue description , that is swallowed by cloud init, but the issue of not being able to roll-back in case of upgrade failure remains. What I meant is that, the example given here with CAPD and cloudinit might not be right example, lets consider this with another example:
We have initialized KCP with 3 replicas with k8s v1.18.0 We edited KCP to upgrade to k8s v1.18.2. Everything worked smoothly. Now, we want KCP to upgrade to k8s v1.18.4. We edit the KCP and accidentally put the k8s version to v1.18.3. It gives the following error:
[manager] Unable to find image 'kindest/node:v1.18.3' locally
[manager] docker: Error response from daemon: manifest for kindest/node:v1.18.3 not found: manifest unknown: manifest unknown.
[manager] See 'docker run --help'.
[manager] I0826 12:09:46.314683 8 dockermachine_controller.go:200] controllers/DockerMachine/DockerMachine-controller "msg"="failed to create worker DockerMachine: command \"docker run --detach --tty --privileged --security-opt seccomp=unconfined --tmpfs /tmp --tmpfs /run --volume /var --volume /lib/modules:/lib/modules:ro --hostname my-cluster-my-controlplane-lgp6m --name my-cluster-my-controlplane-lgp6m --label io.x-k8s.kind.cluster=my-cluster --label io.x-k8s.kind.role=control-plane --expose 38845 --publish=127.0.0.1:38845:6443/TCP kindest/node:v1.18.3\" failed with error: exit status 125, cleaning up so we can re-provision from a clean state" "cluster"="my-cluster" "docker-cluster"="my-cluster" "docker-machine"={"Namespace":"default","Name":"my-controlplane-vvvtg"} "machine"="my-controlplane-lgp6m"
[manager] E0826 12:09:46.350051 8 dockermachine_controller.go:129] controllers/DockerMachine/DockerMachine-controller "msg"="failed to patch DockerMachine" "error"="error patching conditions: The condition \"Ready\" was modified by a different process and this caused a merge/AddCondition conflict: \u0026v1alpha3.Condition{\n \tType: \"Ready\",\n \tStatus: \"False\",\n- \tSeverity: \"Info\",\n+ \tSeverity: \"Warning\",\n- \tLastTransitionTime: v1.Time{Time: s\"2020-08-26 12:09:41 +0000 UTC\"},\n+ \tLastTransitionTime: v1.Time{Time: s\"2020-08-26 12:09:46 +0000 UTC\"},\n- \tReason: \"WaitingForBootstrapData\",\n+ \tReason: \"ContainerProvisioningFailed\",\n \tMessage: \"0 of 2 completed\",\n }\n" "cluster"="my-cluster" "docker-cluster"="my-cluster" "docker-machine"={"Namespace":"default","Name":"my-controlplane-vvvtg"} "machine"="my-controlplane-lgp6m"
[manager] E0826 12:09:46.350207 8 controller.go:248] controller-runtime/controller "msg"="Reconciler error" "error"="error patching conditions: The condition \"Ready\" was modified by a different process and this caused a merge/AddCondition conflict: \u0026v1alpha3.Condition{\n \tType: \"Ready\",\n \tStatus: \"False\",\n- \tSeverity: \"Info\",\n+ \tSeverity: \"Warning\",\n- \tLastTransitionTime: v1.Time{Time: s\"2020-08-26 12:09:41 +0000 UTC\"},\n+ \tLastTransitionTime: v1.Time{Time: s\"2020-08-26 12:09:46 +0000 UTC\"},\n- \tReason: \"WaitingForBootstrapData\",\n+ \tReason: \"ContainerProvisioningFailed\",\n \tMessage: \"0 of 2 completed\",\n }\n" "controller"="dockermachine" "name"="my-controlplane-vvvtg" "namespace"="default"
Now we realize the mistake and edit kcp again and fix k8s to v1.18.4 but we still see that it is still trying to rollout KCP with v1.18.3. No rollback of the previous case is happening and new changes are not respected.
@kashifest I got your point.
I think that what you are calling rollback is similar to the problem that we are trying to fix with the machine health check: how to identify and get rid of broken control plane nodes in a controlled wait (e.g without losing quorum).
In fact, I assume that after #3185 lands, the problem of getting rid of the v1.18.3 machine (and thus unblock the installation of v1.18.4 ones) could be already covered by the machine health check if a NodeStartupTimeout
is properly configured
// Machines older than this duration without a node will be considered to have
// failed and will be remediated.
// +optional
NodeStartupTimeout *metav1.Duration `json:"nodeStartupTimeout,omitempty"`
Might be someone else could chime in and confirm my assumption here
/milestone Next
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
given that KCP remediation is merged, I'm +1 for closing this issue. Opinions?
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
/close
@vincepri: Closing this issue.
What steps did you take and what happened:
While upgrading KCP, if there is an upgrade failure due to a bug in cloud-init, the machine is requed. This is expected. However, when the KCP is edited again and cloud-init error is fixed, KCP still tries to run the previous version of the bootstrap in KCP ignoring the corrected version of KCP.
What did you expect to happen: KCP would roll back the problematic KCP upgrade and keep into account the new change applied on KCP and roll out new CP with correct cloud-init. There should be a way to re-apply upgrade in case upgrade fails.
Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]
A typical example would be to add an erroneous command in prekubeadm like
ap update
. This is the errorNow since there was a typo in prekubeadm command, once KCP is edited again and the command is corrected to
apt update
, it can be seen that this is overlooked and docker controller was still trying to provision a new KCP but still with the old bootstrap data which had the typo in prekubeadm./kind bug [One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]