teemtee / tmt

Test Management Tool
MIT License
80 stars 121 forks source link

Teach `tmt` to reboot unresponsive machines #1523

Open rh-mcermak opened 1 year ago

rh-mcermak commented 1 year ago

For aggressive testing, we need a way to reboot an unresponsive test box, so that the testing can resume after reboot. At the moment this functionality is in beaker-jobwatch (--reboot). I need to replicate this in TMT.

Related issues/PRs

happz commented 1 year ago

Few notes - probably belongs to the same area in which tmt runs a background watchdog for a guest (if enabled) to watch for kernel panics and similar issues. It would need tweakable timeouts, a way to enable/disable, and maybe is not applicable to every provision plugin out there (local might be moot...)

rh-mcermak commented 1 year ago

@psss could we prioritize this, please? Wdyt?

psss commented 1 year ago

Sorry, this slipped through the cracks. Not sure what could be a realistic timeframe for this as there is already so much stuff on the plate but I've added it on the next hacking session to discuss this.

rh-mcermak commented 1 year ago

Prioritizing this would be very much appreciated.

psss commented 11 months ago

@happz planning to look into this for the next release.

rh-mcermak commented 11 months ago

Sounds great! Thank you, guys.

thrix commented 10 months ago

So this will be implemented based on checks, user will be required to ask for this feature.

happz commented 10 months ago

So, kicking things off in https://github.com/teemtee/tmt/pull/2412. There's still work needed, probably will not be ready for 1.29, but the next release, 1.30 at the beginning of December, sounds doable to me.

For now, a thread is spawned which pings the guest running a test. There's logging and a few controls, but no reboots.

happz commented 10 months ago

@rh-mcermak would you mind sharing your current or common setup for this feature? I.e. how often do your tools ping the box, how many packets, and how long it takes for the check to decide to reboot the box, in terms of ping attempts and time? Are there any caveats, any exceptions, e.g. situations in which there's no response for allocated time and yet the machine is not expected to be rebooted?

I have very simple setups, suitable for tests, and I'd like to have at least one real-life use case, to set some reasonable defaults, and test the use case and a few corner cases. To learn how you use this type of feature.

rh-mcermak commented 10 months ago

Hello sir, thanks a ton for looking into it! I'm giving it a one packet ping attempt every minute, and I keep track of last_successful_ping for every test box. If, at any time, the last_successful_ping is older than an hour, reboot. Something like this :)

happz commented 9 months ago

Update: proposed #2469. Watchdog runs in its own thread, while another thread runs the test (e.g. SSH which then runs test.sh wrapper on a guest). Watchdog can detect the guest is not responding to pings and/or SSH connection, and may issue a reboot. This action needs to be communicated to the thread managing the test - either that the reboot has been issued, or that the reboot is needed. And there is no place through which these two threads could exchange information, there is no single data structure shareable by both parties. Code passes (plugin, test, guest) tuple between methods, and it makes no sense to add a fourth item. On the other hand, if we bundle these three into a single structure, we can then dedicate a key in this structure, to hold the "reboot required" flag, for example. The flag would be mostly ignored, and it would remain untouched, but the watchdog check would use it to notify the test thread it's time to act.

happz commented 9 months ago

I'm going to file a couple of issues to track what's needed to support hard reboot and/or reboot without a working SSH connection. It's the missing piece I have not touched yet, but starting with #2512.

happz commented 9 months ago

Bumping to 1.31. It's not forgotten, it's not dead, it's still moving but there is too much work.

happz commented 7 months ago

Bumping to 1.32. Necessary components are being merged, but I'm pretty sure there would be no time left to review the main patch properly before 1.31 deadline :/

rh-mcermak commented 7 months ago

I've been testing commit b019d65895c38b51856fae3c291c9a55327e5310 and it seemt to test fine. One important factor affecting how reboot works after a kernel crash is kdump service. It, by default, reboots a system after a kernel panic. To let TMT reboot the machine on its own, kdump service needs to be disabled (systemctl stop kdump). The ppc64le systemts are specific in that they don't use kdump. Instead they use fadump. Fadump is not a service that could be stopped. Instead it runs in the firmware of the test system. To suppress a reboot driven by kdump, --kernel-options='panic=0' --kernel-options-post='panic=0' can be used with beaker client command line. However what happens next is that the system panics, but it still responds to pings (!). To detect such state, ping isn't a sufficient detection method. An attempt to ssh in, for instance, seems to work better.

Another thing coming to my mind is that a system can simply fail to reboot. Typical problem is that after a reboot, a system boots into a differenc OS compared to what it was running before.

Another problam might be that user/owner will change after the reboot.

Not saying that TMT should handle all this, just brainstorming ;)

rh-mcermak commented 7 months ago

Thanks a ton for implementing the feature! I'd love to start moving to it :)