Open mreiger opened 3 years ago
Here are related metal-api error logs:
{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","status":403,"content-length":55,"duration":0.000260831}
{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:100","message":"cannot get user from request","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","error":"Wrong HMAC found","got":"667c82a5692d7e1013a88b022af0f367e3d349860818ff8714b685bed3ae91e3","want":"4a2e464619017e291fd0113c5a884b89d825a5e8ea8fbd6b1a83eb5f1eab86c8"}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"metal-edit@metal-stack.io","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","status":422,"content-length":137,"duration":0.087437639}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"service/service.go:96","message":"service error","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"metal-edit@metal-stack.io","operation":"finalizeAllocation","status":422,"error":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731","service-caller":"machine-service.go:1635","resp":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
And these are from metal-core:
2021-08-04T12:48:48.301245+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011endpoint/abortReinstall.go:38#011Failed to abort reinstall#011{"app": "metal-core", "statusCode": 500, "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "boot information": null}
2021-08-04T12:48:48.299985+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011api/abortReinstall.go:22#011Failed to abort reinstall#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "error": "[POST /v1/machine/{id}/abort-reinstall][403] abortReinstallMachine default Wrong HMAC found (403)"}
2021-08-04T12:48:48.279567+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.276Z#011error#011endpoint/report.go:57#011Unable to report machine back to api.#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
2021-08-04T12:48:48.278363+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.275Z#011error#011api/finalizeAllocation.go:27#011Finalize failed#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
I tried to reconstruct what happened, here is what found:
/report
, returning an error
/finalize-allocation
reinstall
field in the machine to false
/abort-reinstall
, returning an error as well
kexec.Reboot
, restarting the machinereinstall
flag on false
, thinking it's a usual released machine...This certainly is a cluster of very unhappy events.
My question would be: Why does the metal-hammer wipe disks when there is an allocation on the machine? I think, this should be the first thing to prevent before starting to wipe disks. Maybe this condition is to specific: https://github.com/metal-stack/metal-hammer/blob/v0.9.1/cmd/root.go#L129.
I'll investigate a little more to see what we can do.
Apparently it can sometimes happen that an attempt from metal-hammer to report a successful reinstall of a machine fails with an error code 500; in one such instance the machine rebooted and when the metal-hammer next contacted the api was told to do a fresh install, which wiped all local disks.
This is obviously undesired behaviour; how can this issue be fixed?
Error message from metal-hammer ipmi console log when failing to report:
Complete log files of failed reinstall followed by fresh install, and for comparison a successful reinstall, are attached.
successful-reinstall.txt failed-reinstall-and-subsequent-install-with-wipe.txt