Open The-Mule opened 3 weeks ago
TMT uses ssh session to watch the SUT (guest machine running the test). When anything happen to this session, run is aborted completely. It should be acceptable for test to break the connection temporarily (e.g. a test might affect networking for a show time, test might mess with libraries that ssh relies on, etc.). I would like to propose:
- Make ssh session more robust to be able to handle these situations, when session breaks - attempt to retry connection and resume test
Please, check https://github.com/teemtee/tmt/issues/2696, it seems to be related to your situation, and if it would not help, it would be very useful if you could share with us what should be included to help with your case.
(it is very likely that all disruptive test actions will be completed and reverted at that point).
Yeah, in your case, when you and your test are causing the changes on purpose. But telling the difference between "expected" and "the lab in the US is burning down" is the hard part, I would very much dispute the "likely" bit :)
- In rare situations when ssh session cannot be resumed even after several attempts don't abort the run. Attempt to salvage the run results preceding the problem and run the remaining ones from scratch again. Or perhaps disable the problematic test and create a new run without it.
Re-running "from the scratch" would be a possible solution, with or without dropping the test. Restarting the plan, including provisioning a brand new guest to avoid running tests in a tainted environment. Well beyond what tmt can do now though.
IMO #2696 solves (2) for me.
Ad (1). I am starting to realize that this might not be possible. What I am looking for is to be able to "resurrect" closed SSH session. This is not possible by design of SSH unless some other tool handles it underneath the ssh connection - so that once SSH connection is closed and SSH reconnects it can continue where it was closed before (e.g. something like screen or tmux). To be more specific, I have the following plan:
❯ cat crasher.fmf
prepare:
- name: Enable FIPS mode
how: ansible
order: 99
playbook:
- /playbooks/enable-fips.yaml
discover:
- how: shell
tests:
- name: crasher
test: |
set -x
# Backup.
cp /usr/lib64/ossl-modules/fips.so .
# Configure (break openssl).
dd if=/dev/zero of=fips_hmac bs=8 count=1 conv=notrunc
objcopy --update-section .rodata1=fips_hmac fips.so fips_bad_hmac.so
cp fips_bad_hmac.so /usr/lib64/ossl-modules/fips.so
# Trigger.
openssl dgst -sha256 <<<'some text'
# Restore.
cp fips.so /usr/lib64/ossl-modules/fips.so
- name: second test
test: |
openssl dgst -sha256 <<<'some text'
execute:
- how: tmt
It breaks openssl library, that will cause ssh connection to drop and run it aborted. In theory, ssh is able to reconnect even while openssl library is still broken. But obviously tmt won't do that because, if I understand it correctly, it wouldn't be able to just to resume the test anyway. It just does not work that way. So (1) is basically not possible unless the test can resume itself (then it would work thanks to #2696).
So all in all, it seems that aforementioned plan is simply tmt-incompatible.
I would have to modify 'crasher' test to detect reboot and to attempt to restore the library (and then use options added in #2696) to get to 'second test'.
TMT uses ssh session to watch the SUT (guest machine running the test). When anything happen to this session, run is aborted completely. It should be acceptable for test to break the connection temporarily (e.g. a test might affect networking for a show time, test might mess with libraries that ssh relies on, etc.). I would like to propose:
Make ssh session more robust to be able to handle these situations, when session breaks - attempt to retry connection and resume test (it is very likely that all disruptive test actions will be completed and reverted at that point).
In rare situations when ssh session cannot be resumed even after several attempts don't abort the run. Attempt to salvage the run results preceding the problem and run the remaining ones from scratch again. Or perhaps disable the problematic test and create a new run without it.