teemtee / tmt

Test Management Tool
MIT License
79 stars 117 forks source link

Support restart of test when it crashes #2696

Open happz opened 4 months ago

happz commented 4 months ago

As discussed today, there's a use case for restarting a test when it crashes:

09:26:12                 out: :: [ 14:26:12 ] :: [   PASS   ] :: Command 'make all' (Expected 0, got 0)
09:26:12                 out: :: [ 14:26:12 ] :: [  BEGIN   ] :: Running 'echo 1 > /sys/kernel/vkm/write_um_crash'
09:26:12                 out: ./tmt-test-wrapper.sh.default-0: line 1:  6543 Segmentation fault      bash ./write_um.sh
09:26:12                 out: Shared connection to 10.26.28.203 closed.
09:26:12         Command returned '139'.

In this case, the user would like to see the test restarted - the test was killed by a kernel oops, and when restarted, it would take care of follow-up steps, like decoding the kernel dump.

After some discussion, the proposal would be:

sbertramrh commented 4 months ago

Hi @happz and @lukaszachy I found a workaround for my case. By using nohup it no longer causes the test to abort and it continues through the error.

        # Read only crash test
        rlRun "nohup echo 1 > /sys/kernel/vkm/write_ro_crash" "0-255"
        while (! ping -q -c 1 ${SOC///*}); do
            sleep 5
        done
        rlRun "dmesg > dmesg-crash.log"
        rlAssertGrep "Unable to handle kernel write to read-only memory" dmesg-crash.log

result:

15:00:13                 out: :: [ 20:00:13 ] :: [  BEGIN   ] :: Running 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash'
15:00:13                 out: /usr/share/beakerlib/testing.sh: line 896:  1467 Segmentation fault      nohup echo 1 > /sys/kernel/vkm/write_ro_crash
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: Command 'nohup echo 1 > /sys/kernel/vkm/write_ro_crash' (Expected 0-255, got 139)
15:00:13                 out: PING 10.26.28.203 (10.26.28.203) 56(84) bytes of data.
15:00:13                 out: 
15:00:13                 out: --- 10.26.28.203 ping statistics ---
15:00:13                 out: 1 packets transmitted, 1 received, 0% packet loss, time 0ms
15:00:13                 out: rtt min/avg/max/mdev = 0.046/0.046/0.046/0.000 ms
15:00:13                 out: :: [ 20:00:13 ] :: [  BEGIN   ] :: Running 'dmesg > dmesg-crash.log'
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: Command 'dmesg > dmesg-crash.log' (Expected 0, got 0)
15:00:13                 out: :: [ 20:00:13 ] :: [   PASS   ] :: File 'dmesg-crash.log' should contain 'Unable to handle kernel write to read-only memory' 
pablmart commented 4 months ago

Hello, @happz and @lukaszachy

I wrote a test that forcibly perform a stack underflow within a kernel module, that causes a BUG and subsequent restart after configuring 5 seconds of kernel.panic with sysctl

[ 1748.996748] BUG: unable to handle page fault for address: ffffaa90401e8000 [ 1748.996751] #PF: supervisor read access in kernel mode [ 1748.996752] #PF: error_code(0x0000) - not-present page [ 1748.996753] PGD 1800067 P4D 1800067 PUD 1a0e067 PMD 1a18067 PTE 0 [ 1748.996759] Oops: 0000 [#1] PREEMPT_RT SMP NOPTI [ 1748.996762] CPU: 3 PID: 50 Comm: ksoftirqd/3 Tainted: G OE X ------- --- 5.14.0-427.380.el9iv.x86_64 #1 [ 1748.996765] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS edk2-20230524-3.fc38 05/24/2023 [ 1748.996766] RIP: 0010:tasklet_fn+0x66/0x78 [stackman] [ 1748.996770] Code: 75 02 eb fe 58 ff c8 75 fb eb 1a 48 c7 44 24 10 79 56 34 12 e8 a7 fe ff ff 48 c7 c7 f8 10 86 c0 e8 8e fe 75 f2 b8 00 00 01 00 <58> ff c8 75 fb 48 c7 c7 b6 10 86 c0 5b e9 77 fe 75 f2 90 90 90 90

I tried with 'rstrnt-prepare-reboot' before loading the module that causes the crash, but tmt disconnects, tries to rsync and times out.

I think this one and other two tests for testing memory violation handling within the kernel are cases in favor of implementing this feature.

happz commented 3 months ago

@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.

pablmart commented 3 months ago

@pablmart hello, could you share the test? I'd like to use it as a reproducer when working on the feature.

Yes the test is on the same repo linked in the above comment mentioning 'rstrnt-prepare-reboot':

kernel-stack-overflow-udnerflow-scribbling

weiwang-linda commented 3 months ago

I encountered a similar problem when testing ftrace= kernel parameter with tmt run.

Test with auto-osbuild-qemu-rhivos9-qa-ostree-aarch64-7874633.e1769674.qcow2.xz by manual

The available tracers are: $cat /sys/kernel/debug/tracing/available_tracers timerlat osnoise hwlat blk function_graph wakeup_dl wakeup_rt wakeup function nop

  1. Install a vm with above image
  2. export CMDLINEARGS="ftrace=timerlat"
  3. rpm-ostree kargs --append-if-missing="${CMDLINEARGS##-}" --import-proc-cmdline
  4. systemctl reboot Then the host cannot ssh connect again. Only "timerlat" and "osnoise" make host panic.
happz commented 3 months ago

Kicking off the implementation of the actual test restart in https://github.com/teemtee/tmt/pull/2870. It does have some rough edges, although there is a test that passes.

I plan to run it with the kernel-stack-overflow-udnerflow-scribbling test provided by @pablmart, feel free to experiment too.

One piece we need to address ASAP - naming. I picked some names for new keys, but they are ugly and I don't like them. I can change them easily, but I'm out of ideas - feel free to propose changes here as well, besides the actual bugs and issues :)

happz commented 2 months ago

A similar case: what if the test does not crash, but triggers a reboot, e.g. through Ansible role, unable to use tmt-reboot? This would manifest as a broken SSH session:

                out: TASK [sap_general_preconfigure : Flush handlers] *******************************
                out: 
                out: RUNNING HANDLER [sap_general_preconfigure : Reboot the managed node] ***********
                out: Shared connection to restqe01 closed.
            cmd: rsync --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: dnf --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: rpm-ostree --version
            err: ssh: connect to host restqe01 port 22: Connection refused
            cmd: yum install -y rsync
            err: ssh: connect to host restqe01 port 22: Connection refused
pablmart commented 2 months ago

The MR 2870 solves the issue with the kernel-stack-overflow-underflow-scribbling test. Many thanks!

coiby commented 18 hours ago

Hi @pablmart,

This is Coiby from the kernel debug sst. I'm considering adopting tmt for https://github.com/rhkdump/kdump-utils tests. In our tests, we need to trigger a kernel crash intentionally and then check if the crash dump can be collected. I want to study your kernel-stack-overflow-underflow-scribbling test to learn to make use #2870 but it's gone now. Can you re-share it with me? Thanks!

coiby commented 14 hours ago

Hi @happz,

I wrote a mutihost test which is to dump a kernel crash to a remote NFS server . but unfortunately it failed with an error,

                        kdump: Starting kdump: [OK]
                        :: [ 10:08:20 ] :: [   PASS   ] :: Command 'kdumpctl restart' (Expected 0, got 0)
                        :: [ 10:08:20 ] :: [  BEGIN   ] :: Running 'echo 1 > /proc/sys/kernel/sysrq'
                        :: [ 10:08:20 ] :: [   PASS   ] :: Command 'echo 1 > /proc/sys/kernel/sysrq' (Expected 0, got 0)
                        client_loop: send disconnect: Broken pipe
                    journal.txt: /var/tmp/tmt/run-035/plans/kdump/execute/data/guest/client/client-test/tests/client-4/journal.txt

With restart-on-exit-code provided by #2870, I expected the test will be restarted after a kernel panic,

diff --git a/tests/client/main.fmf b/tests/client/main.fmf
index d74446f..261b8bb 100644
--- a/tests/client/main.fmf
+++ b/tests/client/main.fmf
@@ -1,3 +1,5 @@
 summary: Dump kernel crash to an NFS server
 test: ./test.sh
 framework: beakerlib
+restart-on-exit-code: 79
+restart-max-count: 5

But unfortunately it doesn't work. It seems I miss something? Can you provide a clue? Thanks!