teemtee / tmt

Test Management Tool
MIT License
80 stars 121 forks source link

cannot tmt-reboot from an independent session #2067

Closed kkaarreell closed 10 months ago

kkaarreell commented 1 year ago

While running a multi-host test I had a need to issue a reboot command through ssh session (an "operator" rebooting a worker). However, tmt-reboot didn't work properly in this case and no reboot was made as the tmt-reboot script relies on various environment variables in order to store the right data and these environment variables are not available in a custom ssh session.

This is what I see in terminal for the tmt process, variables like TMT_REBOOT_REQUEST are probably needed for proper functioning.

root        1333    1317  0 15:27 pts/1    00:00:00 bash -c export SERVERS='192.168.122.207 192.168.122.251 192.168.122.95 192.168.122.211'; export TMT_ROLE_OPERATOR=192.168.122.207; export TMT_ROLE_VERIFIER=192.168.122.251; export TMT_ROLE_REGISTRAR=192.168.122.95; export TMT_ROLE_AGENT=192.168.122.211; export TMT_GUEST_ROLE=AGENT; export TMT_GUEST_HOSTNAME=192.168.122.211; export TMT_PLAN_DATA=/var/tmp/tmt/run-057/plans/documentation/data; export TMT_TREE=/var/tmp/tmt/run-057/plans/documentation/tree; export TMT_TEST_NAME=/documentation/ensuring-system-integrity-with-keylime; export TMT_TEST_DATA=/var/tmp/tmt/run-057/plans/documentation/execute/data/guest/agent/documentation/ensuring-system-integrity-with-keylime-1/data; export TMT_TEST_METADATA=/var/tmp/tmt/run-057/plans/documentation/execute/data/guest/agent/documentation/ensuring-system-integrity-with-keylime-1/metadata.yaml; export TMT_REBOOT_REQUEST=/var/tmp/tmt/run-057/plans/documentation/execute/data/guest/agent/documentation/ensuring-system-integrity-with-keylime-1/data/reboot-request; export TMT_REBOOT_COUNT=0; export REBOOTCOUNT=0; export RSTRNT_REBOOTCOUNT=0; export BEAKERLIB_DIR=/var/tmp/tmt/run-057/plans/documentation/execute/data/guest/agent/documentation/ensuring-system-integrity-with-keylime-1; export BEAKERLIB_COMMAND_SUBMIT_LOG='bash /usr/local/bin/tmt-file-submit'; cd /var/tmp/tmt/run-057/plans/documentation/discover/ensuring-system-integrity-with-keylime/tests/documentation/ensuring-system-integrity-with-keylime; set -eo pipefail; ./tmt-test-wrapper.sh </dev/null |& ca
lukaszachy commented 1 year ago

Just a note: The naive 'replace TMT_REBOOT_REQUEST' in tmt-reboot script with real path and make it call 'reboot' didn't help. What I did:

tmt-reboot command got changed into (just the end of it has changed)

[default-1 (agent)]                    out: printf "$json_fmt" "$command" "$timeout" > "/var/tmp/tmt/FOO/plan/execute/data/guest/default-1/test-1/data/reboot-request"
[default-1 (agent)]                    out: reboot
[default-1 (agent)]                    out: kill $PPID

However when it was executed:

default-0 (operator)]                     Connection to 192.168.122.72 closed by remote host.
[default-1 (agent)]            Command returned '255'.
        Append to file '/var/tmp/tmt/FOO/plan/execute/data/guest/default-1/test-1/output.txt'.
[default-1 (agent)]            Copy '/var/tmp/tmt/FOO/plan/execute/data/guest/default-1/test-1' from the guest to '/'.
[default-1 (agent)]            Run command: rsync -s -R -r -z --links --safe-links --protect-args --exclude '/var/tmp/tmt/FOO/plan/execute/data/guest/default-1/test-1/backup*' -e 'ssh -oForwardX11=no -oStrictHostKeyChecking=no -oUserKnownHostsFile=/dev/null -oServerAliveInterval=60 -oServerAliveCountMax=5 -oIdentitiesOnly=yes -p22 -i /var/tmp/tmt/FOO/plan/provision/default-1/id_ecdsa -S/run/user/17062/tmt/tmp2709csg5' root@192.168.122.72:/var/tmp/tmt/FOO/plan/execute/data/guest/default-1/test-1 /
[default-1 (agent)]            err: Warning: Permanently added '192.168.122.72' (ED25519) to the list of known hosts.
[default-1 (agent)]            err: Connection closed by 192.168.122.72 port 22
[default-1 (agent)]            err: rsync: connection unexpectedly closed (0 bytes received so far) [Receiver]
[default-1 (agent)]            err: rsync error: unexplained error (code 255) at io.c(231) [Receiver=3.2.7]
[default-1 (agent)]            Command returned '255'.
[default-1 (agent)]            Ensure that rsync is installed on the guest.
.... more attempts to install rsync
[default-1 (agent)]            fail: Failed to pull workdir from the guest. This usually means that login as 'root' to the guest does not work

execute step failed
lukaszachy commented 1 year ago

Problem is that tmt-reboot has to create the file and exit the process tmt is running on that guest (as test). After that tmt needs to pull data (so guest has to be up and sshd has to be running) and tmt only then check whether TMT_REBOOT_REQUEST file exists. If it does, tmt reboots the guest.

Currently we have not channel to communicate with tmt process on the host :/

kkaarreell commented 1 year ago

So it really sounds that we should have something like a pid file or envvar file at a given location telling which variables should be used and which process should be terminated.

lukaszachy commented 1 year ago

Another use case: Calling 'tmt-reboot' happens in child process (e.g. during pytest testcase).

thrix commented 1 year ago

Another use case, testing-farm reserve command needs also tmt-reboot to work :)

thrix commented 1 year ago

@lukaszachy @kkaarreell so is there some advised solution?

that pidfile / envvar file is something we could have.

As this is blocking reboots for the new reserve workflow ... maybe I would look at it rather sooner

happz commented 1 year ago

Everything I've been able to come up with is prone to race conditions. Well, at least the easy solutions. We need a place where tmt can reliably store a PID of the test process (tree), but that can be arranged, with file locking it can be made safe. Reading the file, its modifications and removal can be protected, and serialized, and tmt-reboot would see a consistent content of the PID file.

Yet it still may read the file and try to terminate the test while the test finished between these two actions, but before tmt was able to acquire the lock and update the test pid file... And, the test pid is unknown until it actually starts, so tmt needs to start the test, then take a step aside, update the PID file, and hope the test did not complete while it was doing this :)

As an asynchronous event, it should be advised against using tmt-reboot from outside the test itself, unless the user knows what they are doing and prepared the test to survive such interruption. The reboot signal might come in the middle of rlAssert, for example. Extra care of synchronization needs to be taken, but I guess that's users' responsibility.

kkaarreell commented 12 months ago

Is tmt running the actual test on a test system directly or using a wrapper script? I though there is some wrapper which could store that PID too.

EDIT: Yes, there is

# ps -ef | grep tmt
root        4901    4232  0 03:47 pts/1    00:00:00 bash -c export TMT_PLAN_DATA=/var/tmp/tmt/run-097/plans/default/data; export TMT_TREE=/var/tmp/tmt/run-097/plans/default/tree; export TMT_TEST_NAME=/Library/sync; export TMT_TEST_DATA=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/data; export TMT_TEST_SERIAL_NUMBER=1; export TMT_TEST_METADATA=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/metadata.yaml; export TMT_REBOOT_REQUEST=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/data/reboot-request; export TMT_REBOOT_COUNT=0; export REBOOTCOUNT=0; export RSTRNT_REBOOTCOUNT=0; export BEAKERLIB_DIR=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1; export BEAKERLIB_COMMAND_SUBMIT_LOG='bash /usr/local/bin/tmt-file-submit'; export TMT_TOPOLOGY_YAML=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/tmt-test-topology.yaml; export TMT_TOPOLOGY_BASH=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/tmt-test-topology.sh; cd /var/tmp/tmt/run-097/plans/default/discover/default-0/tests/Library/sync; set -eo pipefail; ./tmt-test-wrapper.sh.default-0 </dev/null |& cat

root        4920    4901  0 03:47 pts/1    00:00:00 bash -c export TMT_PLAN_DATA=/var/tmp/tmt/run-097/plans/default/data; export TMT_TREE=/var/tmp/tmt/run-097/plans/default/tree; export TMT_TEST_NAME=/Library/sync; export TMT_TEST_DATA=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/data; export TMT_TEST_SERIAL_NUMBER=1; export TMT_TEST_METADATA=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/metadata.yaml; export TMT_REBOOT_REQUEST=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/data/reboot-request; export TMT_REBOOT_COUNT=0; export REBOOTCOUNT=0; export RSTRNT_REBOOTCOUNT=0; export BEAKERLIB_DIR=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1; export BEAKERLIB_COMMAND_SUBMIT_LOG='bash /usr/local/bin/tmt-file-submit'; export TMT_TOPOLOGY_YAML=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/tmt-test-topology.yaml; export TMT_TOPOLOGY_BASH=/var/tmp/tmt/run-097/plans/default/execute/data/guest/default-0/Library/sync-1/tmt-test-topology.sh; cd /var/tmp/tmt/run-097/plans/default/discover/default-0/tests/Library/sync; set -eo pipefail; ./tmt-test-wrapper.sh.default-0 </dev/null |& cat

That /tmt-test-wrapper.sh.default-0 is what I am speaking about, currently it contains just

./runtest.sh

but before running it it could store PID or PPID (not sure now what in fact we need) to a required location.

happz commented 12 months ago

Is tmt running the actual test on a test system directly or using a wrapper script? I though there is some wrapper which could store that PID too.

Yes, there's a wrapper script, https://github.com/teemtee/tmt/blob/main/tmt/steps/execute/internal.py#L31. I'm trying to update the wraper and tmt-reboot to play safely together.

happz commented 12 months ago

Well, here's the first attempt, https://github.com/teemtee/tmt/pull/2311.

To use tmt-reboot from another session while a test is running, this should work:

$ export TMT_TEST_PIDFILE=/var/tmp/tmt/tmt-test.pid
$ export TMT_TEST_PIDFILE_LOCK=/var/tmp/tmt/tmt-test.pid.lock
$ tmt-reboot
kkaarreell commented 12 months ago

Can you use these values as defaults so that users dont have to define them? Actually, why would I use different location?

happz commented 12 months ago

Can you use these values as defaults so that users dont have to define them?

Good point, I will do that. There will be rough edges like this one, it’s still a draft. Point out more issues, please.

Actually, why would I use different location?

I don’t know, and I don’t want to care :) I for one am not using /var/tmp/tmt at all, therefore I want to keep these configurable.

happz commented 12 months ago

Can you use these values as defaults so that users dont have to define them?

Done. Also improved logging, and fixed error handling.

There's a test for this functionality, but TBH, I'm not sure we're close to providing the needed fix. I'm worried about loose ends here and corner cases there. Testing & reviews are needed & appreciated.