When Restoring From Annotation Fails BMO Will Instead Deprovision and Start Intake

Lukasz-SendGrid commented 4 years ago

Some context, I was restoring a one node cluster which was managing a 9 node main management cluster which manages many other clusters. To perform the backup I paused the cluster object, made sure all the status and paused annotations were added to the bmhs, and did a velero backup.

During the restore I brought up all the capi/capm3/bmo components, scaled down bmo, did the velero restore. At this point my cluster was paused and the bmhs had the paused/status annotations. I then scaled up bmo and unpaused the cluster. And 5 of the 9 were fine, but for 4 of them this error was returned and they started to reintake:

{"level":"error","ts":1602102667.1492743,"logger":"controller-runtime.controller","msg":"Reconciler error","controller":"metal3-baremetalhost-controller","request":"metal3/metalkubed-mgmt-prometheus0001p1mdw1.sendgrid.net","error":"Could not restore status from annotation: Operation cannot be fulfilled on baremetalhosts.metal3.io \"metalkubed-mgmt-prometheus0001p1mdw1.sendgrid.net\": the object has been modified; please apply your changes to the latest version and try again","errorVerbose":"Operation cannot be fulfilled on baremetalhosts.metal3.io \"metalkubed-mgmt-prometheus0001p1mdw1.sendgrid.net\": the object has been modified; please apply your changes to the latest version and try again\nCould not restore status from annotation\ngithub.com/metal3-io/baremetal-operator/pkg/controller/baremetalhost.(*ReconcileBareMetalHost).Reconcile\n\t/go/src/github.com/metal3-io/baremetal-operator/pkg/controller/baremetalhost/baremetalhost_controller.go:219\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1373","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.1/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:258\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.5.2/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.17.4/pkg/util/wait/wait.go:88"}

Fortunately none of the 4 were controlplane nodes, so I was able to reintake and add them to the cluster easily. However, if instead my controlplane was part of the 4 nodes which couldn't be restored from annotation and they started to get deprovisioned that would be rough.

Would it be possible to add a feature when restoring from annotation fails to throw an error and keep the bmh it failed on paused until manual intervention fixes the status annotation?

Thanks!

maelk commented 4 years ago

This behaviour is quite surprising. If BMO fails to restore the status, it should not remove the annotation, and simply retry : https://github.com/metal3-io/baremetal-operator/blob/master/controllers/metal3.io/baremetalhost_controller.go#L140 . So it should just retry. Could you please let us know what version of BMO you are using (which tag you are using, or any info that would help us) and put the logs for the failed nodes (specially around the annotation restoring failure, and what happens after that). Thank you

Lukasz-SendGrid commented 4 years ago

Hi @maelk Thanks for looking into this! I am running bmo/capm3 version 0.4.0 capi version v0.3.8 I've split out the logs centering around each specific bmh error type and attached them below. I noticed that between the 4 nodes two first threw this error Could not delete status annotation and the other two Could not restore status from annotation The logs for Could not restore status from annotation differed slightly so I'm attaching both of those.

deleteErr.log restoreErr.log restoreErr0.log

maelk commented 4 years ago

In the logs you gave, BMO was able to go pass the errors, it just retried when it hit the conflict. I fail to understand the issue here. It looks like it was able to restore the status on the second try. What were you expecting to see ?

Lukasz-SendGrid commented 4 years ago

I see yes that is true but after each of the 4 then hit this part same in all the logs when they started deprovisioning and I guess that's the question at hand. Since it looks like it was able to read in that the bmh was in the provisioned state, but with the timestamps right after it started to deprovision the box why would that occur?

{"level":"info","ts":1602102703.4905202,"logger":"baremetalhost","msg":"saving host status","Request.Namespace":"metal3","Request.Name":"metal3-mgmt-node0002p1mdw1.lukasz.net","provisioningState":"registering","operational status":"OK","provisioning state":"provisioned"}
{"level":"info","ts":1602102703.500233,"logger":"baremetalhost","msg":"done","Request.Namespace":"metal3","Request.Name":"metal3-mgmt-node0002p1mdw1.lukasz.net","provisioningState":"registering","requeue":true,"after":0}
{"level":"info","ts":1602102703.5003352,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"metal3","Request.Name":"metal3-mgmt-node0002p1mdw1.lukasz.net"}
{"level":"info","ts":1602102703.5004506,"logger":"baremetalhost","msg":"changing provisioning state","Request.Namespace":"metal3","Request.Name":"metal3-mgmt-node0002p1mdw1.lukasz.net","provisioningState":"provisioned","old":"provisioned","new":"deprovisioning"}
{"level":"info","ts":1602102703.5004747,"logger":"baremetalhost","msg":"saving host status","Request.Namespace":"metal3","Request.Name":"metal3-mgmt-node0002p1mdw1.lukasz.net","provisioningState":"provisioned","operational status":"OK","provisioning state":"deprovisioning"}
{"level":"info","ts":1602102703.5057342,"logger":"baremetalhost","msg":"done","Request.Namespace":"metal3","Request.Name":"metal3-mgmt-node0002p1mdw1.lukasz.net","provisioningState":"provisioned","requeue":true,"after":0}
{"level":"info","ts":1602102703.5065362,"logger":"baremetalhost","msg":"Reconciling BareMetalHost","Request.Namespace":"metal3","Request.Name":"metal3-mgmt-node0002p1mdw1.lukasz.net"}
{"level":"info","ts":1602102703.506639,"logger":"baremetalhost","msg":"deprovisioning","Request.Namespace":"metal3","Request.Name":"metal3-mgmt-node0002p1mdw1.lukasz.net","provisioningState":"deprovisioning"}
{"level":"info","ts":1602102703.5066545,"logger":"baremetalhost_ironic","msg":"deprovisioning","host":"metal3-mgmt-node0002p1mdw1.lukasz.net"}

Lukasz-SendGrid commented 4 years ago

Well after going through the capi logs, it looks like the problem lies with new machine sets getting created even though the existing ones successfully adopted the machines associated with them.

capilog.log

Lukasz-SendGrid commented 4 years ago

Which led me to the capm3 logs, here it looks like the while it is waiting for 0 hosts available while choosing host for Metal3 machine at the same time bmo was still trying to reconcile the old status when comparing the timestamps, and before it can reconcile correctly the old status, the m3m gets marked for deletion. Am I on the right track here? capm3.log

maelk commented 4 years ago

I would say the issue is probably in CAPI only. You can see in the logs :

I1007 20:31:03.001989       1 machineset_controller.go:313] controllers/MachineSet "msg"="Too few replicas" "machineset"="mgmt-prod-a.mdw1-prometheus-6857b5dbb9" "namespace"="metal3" "creating"=2 "need"=2
I1007 20:31:03.002020       1 machineset_controller.go:321] controllers/MachineSet "msg"="Creating machine 1 of 2, ( spec.replicas(2) \u003e currentMachineCount(0) )" "machineset"="mgmt-prod-a.mdw1-prometheus-6857b5dbb9" "namespace"="metal3" 
I1007 20:31:03.013917       1 machineset_controller.go:313] controllers/MachineSet "msg"="Too few replicas" "machineset"="mgmt-prod-a.mdw1-node-5b5bc67cc7" "namespace"="metal3" "creating"=2 "need"=2
I1007 20:31:03.013949       1 machineset_controller.go:321] controllers/MachineSet "msg"="Creating machine 1 of 2, ( spec.replicas(2) \u003e currentMachineCount(0) )" "machineset"="mgmt-prod-a.mdw1-node-5b5bc67cc7" "namespace"="metal3" 
I1007 20:31:03.202005       1 machineset_controller.go:224] controllers/MachineSet "msg"="Adopted Machine" "machineset"="mgmt-prod-a.mdw1-prometheus-5dbc44ccb4" "namespace"="metal3" "machine"="mgmt-prod-a.mdw1-prometheus-5dbc44ccb4-fd96x"
I1007 20:31:03.303188       1 machineset_controller.go:224] controllers/MachineSet "msg"="Adopted Machine" "machineset"="mgmt-prod-a.mdw1-node-5945dc5864" "namespace"="metal3" "machine"="mgmt-prod-a.mdw1-node-5945dc5864-lsmv9"
I1007 20:31:03.312740       1 machineset_controller.go:385] controllers/MachineSet "msg"="Too many replicas" "machineset"="mgmt-prod-a.mdw1-node-5945dc5864" "namespace"="metal3" "deleting"=2 "need"=0
I1007 20:31:03.312767       1 machineset_controller.go:391] controllers/MachineSet "msg"="Found delete policy" "machineset"="mgmt-prod-a.mdw1-node-5945dc5864" "namespace"="metal3" "delete-policy"="Random"

So basically it creates the replicas before adopting the old machines, and realizes it has too many replicas and deletes the old ones. Was the cluster properly paused when restoring ? If so there might be an issue in CAPI itself. this should not happen in this order.

Lukasz-SendGrid commented 4 years ago

Yes I have the dumps of all the yaml used in the restore and can confirm the cluster was paused, and good catch this looks to have happened before bmo even started reconciling.

Appreciate your help in looking into this @maelk ! I'll close this out and move this over to the capi board and see what they think https://github.com/kubernetes-sigs/cluster-api/issues/3770

metal3-io / baremetal-operator

When Restoring From Annotation Fails BMO Will Instead Deprovision and Start Intake #670