Open coiby opened 6 days ago
After triggering a kernel panic, the system can be rebooted but the test just failed with error.
AFAICT, that is the expected outcome: test (and guest) did not reboot, they crashed, the underlying SSH session was abruptly terminated and all tmt got out of it was an exit code 255. From tmt's point of view, this is a mere crash, and it does not know what else to do than report an error and move on.
I notice a workaround is to execute the kernel panic trigger command by
tmt-reboot
.
Yes, because then the process runs under tmt control, tmt is aware that a reboot is expected: tmt gets your "reboot command", echo c > /proc/sysrq-trigger
, terminates the current SSH session running the test, and connects to the guest once again to run the command you provided. And expects this will result in a reboot, and will continue by waiting for the guest to recover and will restart the test.
Frankly said, you decided to kill the guest without telling tmt about it, so the error outcome is perfectly valid :) I'm not sure we can ever resolve this in some automagical way, tmt being able to realize, something like "aha, this is a kernel panic, guest is rebooting, I shall restart the test!". All ideas we have eventually boil down to letting tmt know about it so it can cooperate with your test. See e.g. https://tmt.readthedocs.io/en/stable/spec/tests.html#restart, I'd say it fits your use case:
restart-on-exit-code:
# this is the exit code tmt receives when SSH session - and the guest - die
# suddenly due to a crash
- 255
# I'd set this to `false`, your test already issues the reboot
restart-with-reboot: false
This should tell tmt that it should wait for the reboot to pass, and reconnect and restart the test.
Thanks for clarification! Because beaker job can resume the test after kernel panic automatically so I expect tmt to also support panic.
I just tested restart-on-exit-code
which still lead to error. Do I still miss anything?
mkdir .fmf
echo -n 1 > .fmf/version
cat << 'EOF' > main.fmf
/tests:
/basic:
restart-on-exit-code:
- 255
test: |
echo 2 > /proc/sys/kernel/panic
sync
if [ "$TMT_REBOOT_COUNT" == 0 ]; then
# tmt-reboot -c "echo c > /proc/sysrq-trigger"
echo c > /proc/sysrq-trigger
fi
echo "Test passed"
EOF
if tmt run -a provision -h virtual; then
echo "Test passed"
else
echo "Test failed"
fi
The following logs with tmt -vvvddd
may be relevant,
flock "$TMT_TEST_PIDFILE_LOCK" -c "rm -f ${TMT_TEST_PIDFILE}" || exit 123;
exit $_exit_code;'
cmd:
echo 2 > /proc/sys/kernel/panic
sync
if [ "$TMT_REBOOT_COUNT" == 0 ]; then
# tmt-reboot -c "echo c > /proc/sysrq-trigger"
echo c > /proc/sysrq-trigger
fi
echo "Test passed"
out: Shared connection to 127.0.0.1 closed.
Command returned '255' (unrecognized).
Append to file '/var/tmp/tmt/run-011/tests/basic/execute/data/guest/default-0/tests/basic-1/output.txt'.
Extract results of '/tests/basic'.
Run command: git rev-parse --is-inside-work-tree
err: fatal: not a git repository (or any parent up to mount point /)
err: Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Command returned '128' (unrecognized).
00:00:15 /tests/basic [1/1]
...
They are relevant, but just a snippet of the full picture. To work correctly, your test needs to check TMT_TEST_RESTART_COUNT
instead of TMT_REBOOT_COUNT
- note that the reboot is outside of tmt's control, it's not managed by tmt, it's not even detected by tmt, therefore TMT_REBOOT_COUNT
will remain unchanged, but TMT_TEST_RESTART_COUNT
will be increased as tmt does restart the test.
Plus there is indeed one minor issue that may lead to errors, see https://github.com/teemtee/tmt/pull/3291.
Together with these two changes, I get an expected picture:
/var/tmp/tmt/run-123
Found 1 plan.
/plans
summary: Basic kernel panic test
discover
how: fmf
directory: /tmp/foo
summary: 1 test selected
/tests/basic
provision
queued provision.provision task #1: default-0
provision.provision task #1: default-0
how: virtual
memory: 2048 MB
disk: 40 GB
qcow: Fedora-Cloud-Base-Generic.x86_64-40-1.14.qcow2
effective hardware: {}
name: tmt-123-LVjyAajf
key: /var/tmp/tmt/run-123/plans/provision/default-0/id_ecdsa
progress: booting...
primary address: 127.0.0.1
topology address: 127.0.0.1
port: 10056
multihost name: default-0
arch: x86_64
distro: Fedora Linux 40 (Cloud Edition)
kernel: 6.8.5-301.fc40.x86_64
package manager: dnf
selinux: yes
is superuser: yes
summary: 1 guest provisioned
prepare
queued push task #1: push to default-0
push task #1: push to default-0
queued prepare task #1: requires on default-0
prepare task #1: requires on default-0
how: install
summary: Install required packages
name: requires
where: default-0
package: 1 package requested
/usr/bin/flock
cmd: rpm -q --whatprovides /usr/bin/flock || dnf install -y /usr/bin/flock
queued pull task #1: pull from default-0
pull task #1: pull from default-0
summary: 1 preparation applied
execute
queued execute task #1: default-0 on default-0
execute task #1: default-0 on default-0
how: tmt
exit-first: false
test: /tests/basic
cmd:
echo 2 > /proc/sys/kernel/panic
sync
if [ "$TMT_TEST_RESTART_COUNT" == 0 ]; then
# tmt-reboot -c "echo c > /proc/sysrq-trigger"
echo c > /proc/sysrq-trigger
fi
echo "Test passed"
00:00:09 /tests/basic [1/1]
test: /tests/basic
cmd:
echo 2 > /proc/sys/kernel/panic
sync
if [ "$TMT_TEST_RESTART_COUNT" == 0 ]; then
# tmt-reboot -c "echo c > /proc/sysrq-trigger"
echo c > /proc/sysrq-trigger
fi
echo "Test passed"
00:00:00 pass /tests/basic (on default-0) [1/1]
summary: 1 test executed
report
how: display
pass /tests/basic
output.txt: /var/tmp/tmt/run-123/plans/execute/data/guest/default-0/tests/basic-1/output.txt
summary: 1 test passed
finish
guest: stopped
guest: removed
summary: 0 tasks completed
total: 1 test passed
After triggering a kernel panic, the system can be rebooted but the test just failed with error. I notice a workaround is to execute the kernel panic trigger command by
tmt-reboot
.Not if the test is written with beakerlib, a similar error will occur and something like
# the errr could also be 00:00:28 errr /client-test/tests/client (on client) (beakerlib: State 'imcomplete') [1/1]
will also be printed.Here are the logs and the reproducer.
Logs
Reproducer