Improve robustness of SSH connection to SUT

The-Mule commented 3 weeks ago

TMT uses ssh session to watch the SUT (guest machine running the test). When anything happen to this session, run is aborted completely. It should be acceptable for test to break the connection temporarily (e.g. a test might affect networking for a show time, test might mess with libraries that ssh relies on, etc.). I would like to propose:

Make ssh session more robust to be able to handle these situations, when session breaks - attempt to retry connection and resume test (it is very likely that all disruptive test actions will be completed and reverted at that point).
In rare situations when ssh session cannot be resumed even after several attempts don't abort the run. Attempt to salvage the run results preceding the problem and run the remaining ones from scratch again. Or perhaps disable the problematic test and create a new run without it.

happz commented 3 weeks ago

TMT uses ssh session to watch the SUT (guest machine running the test). When anything happen to this session, run is aborted completely. It should be acceptable for test to break the connection temporarily (e.g. a test might affect networking for a show time, test might mess with libraries that ssh relies on, etc.). I would like to propose:

Make ssh session more robust to be able to handle these situations, when session breaks - attempt to retry connection and resume test

Please, check https://github.com/teemtee/tmt/issues/2696, it seems to be related to your situation, and if it would not help, it would be very useful if you could share with us what should be included to help with your case.

(it is very likely that all disruptive test actions will be completed and reverted at that point).

Yeah, in your case, when you and your test are causing the changes on purpose. But telling the difference between "expected" and "the lab in the US is burning down" is the hard part, I would very much dispute the "likely" bit :)

In rare situations when ssh session cannot be resumed even after several attempts don't abort the run. Attempt to salvage the run results preceding the problem and run the remaining ones from scratch again. Or perhaps disable the problematic test and create a new run without it.

Re-running "from the scratch" would be a possible solution, with or without dropping the test. Restarting the plan, including provisioning a brand new guest to avoid running tests in a tainted environment. Well beyond what tmt can do now though.

The-Mule commented 1 week ago

IMO #2696 solves (2) for me.

Ad (1). I am starting to realize that this might not be possible. What I am looking for is to be able to "resurrect" closed SSH session. This is not possible by design of SSH unless some other tool handles it underneath the ssh connection - so that once SSH connection is closed and SSH reconnects it can continue where it was closed before (e.g. something like screen or tmux). To be more specific, I have the following plan:

❯ cat crasher.fmf
prepare:
    - name: Enable FIPS mode
      how: ansible
      order: 99
      playbook:
        - /playbooks/enable-fips.yaml

discover:
    - how: shell
      tests:
        - name: crasher
          test: |
            set -x
            # Backup.
            cp /usr/lib64/ossl-modules/fips.so .

            # Configure (break openssl).
            dd if=/dev/zero of=fips_hmac bs=8 count=1 conv=notrunc
            objcopy --update-section .rodata1=fips_hmac fips.so fips_bad_hmac.so
            cp fips_bad_hmac.so /usr/lib64/ossl-modules/fips.so

            # Trigger.
            openssl dgst -sha256 <<<'some text'

            # Restore.
            cp fips.so /usr/lib64/ossl-modules/fips.so
        - name: second test
          test: |
            openssl dgst -sha256 <<<'some text'

execute:
    - how: tmt

It breaks openssl library, that will cause ssh connection to drop and run it aborted. In theory, ssh is able to reconnect even while openssl library is still broken. But obviously tmt won't do that because, if I understand it correctly, it wouldn't be able to just to resume the test anyway. It just does not work that way. So (1) is basically not possible unless the test can resume itself (then it would work thanks to #2696).

So all in all, it seems that aforementioned plan is simply tmt-incompatible.

The-Mule commented 1 week ago

I would have to modify 'crasher' test to detect reboot and to attempt to restore the library (and then use options added in #2696) to get to 'second test'.

teemtee / tmt

Improve robustness of SSH connection to SUT #2890