metal-stack / metal-api

API to manage and control plane resources like machines, switches, operating system images, machine sizes, networks, IP addresses and more
GNU Affero General Public License v3.0
128 stars 9 forks source link

Failure to report reinstall by metal-hammer results in fresh install with wiped disks #207

Open mreiger opened 3 years ago

mreiger commented 3 years ago

Apparently it can sometimes happen that an attempt from metal-hammer to report a successful reinstall of a machine fails with an error code 500; in one such instance the machine rebooted and when the metal-hammer next contacted the api was told to do a fresh install, which wiped all local disks.

This is obviously undesired behaviour; how can this issue be fixed?

Error message from metal-hammer ipmi console log when failing to report:

POST /machine/report/eff49e00-6ff4-11e9-8000-efbeaddeefbe HTTP/1.1
Host: 10.255.255.5:4242
User-Agent: Go-http-client/1.1
Content-Length: 416
Accept: application/json
Content-Type: application/json
Accept-Encoding: gzip

{"bootloaderid":"metal-debian","cmdline":"console=ttyS1,115200n8 root=UUID=3cb7a6b6-65e5-4ff7-8dfd-e00bb5abcf1b init=/bin/systemd net.ifnames=0 biosdevname=0 nvme_core.io_timeout=4294967295 systemd.unified_cgroup_hierarchy=0","console_password":"hbPMo74i1dSQXv/F","initrd":"/tmp/initrd.img-5.10.0-6-amd64","kernel":"/tmp/vmlinuz-5.10.0-6-amd64","message":null,"os_partition":null,"primary_disk":null,"success":true}

DBUG[08-04|12:48:46] lldp                                     detectedNeighbor="Name:nbg-w8101-r01leaf01 Desc:Cumulus Linux version 3.7.13 running on Accton AS7712-32X Chassis:Mac:80:a2:35:e1:39:9c Port:Mac:80:a2:35:e1:39:a0" caller=lldpclient.go:71
DBUG[08-04|12:48:46] lldp                                     detectedNeighbor="Name:nbg-w8101-r01leaf02 Desc:Cumulus Linux version 3.7.13 running on Accton AS7712-32X Chassis:Mac:80:a2:35:32:9d:bb Port:Mac:80:a2:35:32:9d:bf" caller=lldpclient.go:71
HTTP/1.1 500 Internal Server Error
Content-Length: 0
Date: Wed, 04 Aug 2021 12:48:48 GMT

EROR[08-04|12:48:48] report                                   error="[POST /machine/report/{id}][500] reportInternalServerError " caller=report.go:42
EROR[08-04|12:48:48] reinstall failed                         error="unable to report image installation: [POST /machine/report/{id}][500] reportInternalServerError " caller=root.go:140
EROR[08-04|12:48:48] reinstall cancelled => boot into existing OS... reason="unable to report image installation: [POST /machine/report/{id}][500] reportInternalServerError " caller=reinstall.go:25
POST /machine/abort-reinstall/eff49e00-6ff4-11e9-8000-efbeaddeefbe HTTP/1.1
Host: 10.255.255.5:4242
User-Agent: Go-http-client/1.1
Content-Length: 28
Accept: application/json
Content-Type: application/json
Accept-Encoding: gzip

{"primary_disk_wiped":true}

HTTP/1.1 500 Internal Server Error
Content-Length: 27
Content-Type: application/json

Complete log files of failed reinstall followed by fresh install, and for comparison a successful reinstall, are attached.

successful-reinstall.txt failed-reinstall-and-subsequent-install-with-wipe.txt

Gerrit91 commented 3 years ago

Here are related metal-api error logs:

{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","status":403,"content-length":55,"duration":0.000260831}
{"level":"error","time":"2021-08-04T12:48:48.296Z","caller":"rest/middleware.go:100","message":"cannot get user from request","rqid":"5019de598aab8c4f96ea782160836959","remoteaddr":"10.20.4.3","method":"GET","uri":"/metal/v1/machine/abort-reinstall","route":"/metal/v1/machine/{id}","error":"Wrong HMAC found","got":"667c82a5692d7e1013a88b022af0f367e3d349860818ff8714b685bed3ae91e3","want":"4a2e464619017e291fd0113c5a884b89d825a5e8ea8fbd6b1a83eb5f1eab86c8"}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"rest/middleware.go:88","message":"Rest Call","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"metal-edit@metal-stack.io","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","status":422,"content-length":137,"duration":0.087437639}
{"level":"error","time":"2021-08-04T12:48:48.274Z","caller":"service/service.go:96","message":"service error","rqid":"bc0c64ff4f0ae73854a35da29d30d735","remoteaddr":"10.20.4.3","method":"POST","uri":"/metal/v1/machine/eff49e00-6ff4-11e9-8000-efbeaddeefbe/finalize-allocation","route":"/metal/v1/machine/{id}/finalize-allocation","useremail":"metal-edit@metal-stack.io","operation":"finalizeAllocation","status":422,"error":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731","service-caller":"machine-service.go:1635","resp":"the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}

And these are from metal-core:

2021-08-04T12:48:48.301245+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011endpoint/abortReinstall.go:38#011Failed to abort reinstall#011{"app": "metal-core", "statusCode": 500, "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "boot information": null}
2021-08-04T12:48:48.299985+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.298Z#011error#011api/abortReinstall.go:22#011Failed to abort reinstall#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "primary disk already wiped": true, "error": "[POST /v1/machine/{id}/abort-reinstall][403] abortReinstallMachine default  Wrong HMAC found (403)"}
2021-08-04T12:48:48.279567+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.276Z#011error#011endpoint/report.go:57#011Unable to report machine back to api.#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default  the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
2021-08-04T12:48:48.278363+00:00 nbg-w8101-r02leaf01 docker[16471]: 2021-08-04T12:48:48.275Z#011error#011api/finalizeAllocation.go:27#011Finalize failed#011{"app": "metal-core", "machineID": "eff49e00-6ff4-11e9-8000-efbeaddeefbe", "error": "[POST /v1/machine/{id}/finalize-allocation][422] finalizeAllocation default  the machine \"eff49e00-6ff4-11e9-8000-efbeaddeefbe\" could not be enslaved into the vrf vrf731 (422)"}
Gerrit91 commented 3 years ago

I tried to reconstruct what happened, here is what found:

  1. Everything was according to plan, wiped primary disk and reinstalled OS
  2. Then metal-hammer calls metal-core /report, returning an error
    1. The metal-core sets the boot order to HD (but it did not work?)
    2. metal-core calls the metal-api on /finalize-allocation
      • The metal-api sets the reinstall field in the machine to false
      • This function then fails when updating the switch entity (probably "was changed by another", original error message was swallowed)
  3. The metal-hammer calls /abort-reinstall, returning an error as well
    • This time the request was declided immediately from the metal-api because HMAC was incorrect (can be caused by system time lags?)
  4. When this fails, the metal-hammer runs kexec.Reboot, restarting the machine
  5. The metal-hammer reboots, now finds the reinstall flag on false, thinking it's a usual released machine...

This certainly is a cluster of very unhappy events.

My question would be: Why does the metal-hammer wipe disks when there is an allocation on the machine? I think, this should be the first thing to prevent before starting to wipe disks. Maybe this condition is to specific: https://github.com/metal-stack/metal-hammer/blob/v0.9.1/cmd/root.go#L129.

I'll investigate a little more to see what we can do.