monogon-dev / monogon

The Monogon Monorepo. May contain traces of peanuts and a ✨pure Go Linux userland✨. Work in progress!
https://monogon.tech
Apache License 2.0
378 stars 9 forks source link

Equinix Reboot request is not synchronous with the node #215

Open fionera opened 1 year ago

fionera commented 1 year ago

The Equinix API does not synchronously reboot machines. Instead, it schedules a reboot which happens... at some point.

This causes us issues when we want to reboot and then immediately something against a machine, because we can race the reboot: starting the thing before then suddenly getting killed by the machine.

We worked around this by hardcoding a sleep after reboot, but we really should have a way to check the status of a Reboot from Equinix instead.

vielmetti commented 1 year ago

The reboot docs are here:

https://deploy.equinix.com/developers/docs/metal/server-metadata/server-actions/#reboot

specifically this:

While rebooting the server will be unavailable, but should come back after a few minutes, or however long the reboot process takes. Reboot actions initiated from the console, CLI, and API are logged and listed on the server's Timeline tab.

The "Timeline tab" is also used by the boot process, where it's called "events", and where you can post to it with this doc:

https://deploy.equinix.com/developers/docs/metal/server-metadata/user-state/

Which is all a way of saying that I can think of a couple ways to do this, but they might require some code on the node to emit a notice to the Equinix API that your automation can query for.

leoluk commented 1 year ago

@vielmetti Thanks for jumping in!

To explain the problem a little better - we are doing large-scale automation of Equinix deployments, and sometimes, deployments fail for various reasons or a machine stops reporting after deployment. In these cases, we reboot the machine via the Equinix API.

However, the reboot is done asynchronously, so we don't know for sure whether the reboot has actually happened:

The events entry is added at the time of the API call, but as far as we can tell, the BMC reboot request happens later (sometimes a lot later). If something goes wrong, we don't know whether we just need to wait a little longer, or go into recovery.

vielmetti commented 1 year ago

Thanks @leoluk (and I have an email coming to you separately on another topic, look for that).

Is your system able to emit a message to our logs when it's "back up"? And I'll point out here that "up" is really your call to decide - a system with a working network interface might not have all of the other functions up and running to the state where it can handle requests? You'd use the "user state" method described above to send out the "it's really ready" message then.

Separately though it appears that there's a piece of our automation that could be improved - after the "reboot requested" message you'd like to see a "reboot started" message, to reduce the uncertainty of the timing of the whole process.

An ugly but perhaps useful additional approach would be to connect to the user console over SOS ("serial over SSH") during reboot to debug anything that is weird:

https://deploy.equinix.com/developers/docs/metal/resilience-recovery/serial-over-ssh/

but at scale that might be awkward/impractical.

vielmetti commented 1 year ago

@leoluk the other option depending on the "why" of the reboot process is to use kexec() to do a warm reboot and pivot. This avoids a lot of hardware-induced delays but also might not initialize everything you hope for.

leoluk commented 1 year ago

the other option depending on the "why" of the reboot process is to use kexec() to do a warm reboot and pivot. This avoids a lot of hardware-induced delays but also might not initialize everything you hope for.

That's how we launch Monogon OS :-) We first deploy your stock Ubuntu image, then kexec into our first stage loader, install the OS, and then kexec again. It works great. But sometimes, machines crash or the initial deployment fails and then we have to request a hard-reboot using BMC. If all goes well, no reboots - we're already in the recovery path here.

Is your system able to emit a message to our logs when it's "back up"? And I'll point out here that "up" is really your call to decide - a system with a working network interface might not have all of the other functions up and running to the state where it can handle requests? You'd use the "user state" method described above to send out the "it's really ready" message then.

Yep, once the machine is back up, it'll "call home" to our backend (i.e. the green dot on the diagram above) and we'll know it succeeded. We could also post it to the event log if that's helpful for debugging on your end.

The tricky case is when it doesn't, or takes longer than expected, and we don't know in what state it is. Having a "reboot started" message would be great and would solve the problem!

An ugly but perhaps useful additional approach would be to connect to the user console over SOS ("serial over SSH") during reboot to debug anything that is weird

We're actually planning to auto-connect to every SOS and write the logs to a database to debug failed boots or crashes (https://github.com/monogon-dev/monogon/issues/200). But parsing it would be tricky - much of the boot process uses terminal escape sequences, different vendors do things differently, we have to be fast enough to reconnect, etc. A "reboot started" message would be much nicer.

vielmetti commented 1 year ago

Hm. If you're not using reserved instances, then it might be easier/faster/better to just fail hard on any machine that would otherwise require a hard reboot, and provision a brand new machine instead. Instead of "reboot", just "destroy". It will all depend on how long the systems will stay alive though - deploy new will incur a new first-hour charge, whereas hard reboot doesn't have that cost. For some class of systems it will be way faster to boot than to reboot.

leoluk commented 1 year ago

We're dealing with lots of reserved instances, and Equinix's current deployment process appears to have some probabilistic failure modes where retrying does help (there's open tickets for the ones we're aware of). So we generally try to shake out any flakes first, and only if that doesn't work, a ticket is opened with Equinix to fix the host.

vielmetti commented 1 year ago

(Additional info requested privately.)