teemtee / tmt

Test Management Tool
MIT License
82 stars 123 forks source link

test failed with error after triggering a kernel panic #3284

Open coiby opened 6 days ago

coiby commented 6 days ago

After triggering a kernel panic, the system can be rebooted but the test just failed with error. I notice a workaround is to execute the kernel panic trigger command by tmt-reboot.

Not if the test is written with beakerlib, a similar error will occur and something like # the errr could also be 00:00:28 errr /client-test/tests/client (on client) (beakerlib: State 'imcomplete') [1/1] will also be printed.

Here are the logs and the reproducer.

Logs

# tmt run -vvv
/var/tmp/tmt/run-020
Found 1 plan.

/example
summary: Basic kernel panic test
    discover
        how: shell
        order: 50
        summary: 1 test selected
            /script-00
    provision
        queued provision.provision task #1: default-0

        provision.provision task #1: default-0
        how: virtual
        order: 50
        memory: 2048 MB
        disk: 40 GB
        qcow: Fedora-Cloud-Base-Generic.x86_64-40-1.14.qcow2
        effective hardware: {}
        name: tmt-020-PuwHCrwG
        key: /var/tmp/tmt/run-020/example/provision/default-0/id_ecdsa
        progress: booting...
        primary address: 127.0.0.1
        topology address: 127.0.0.1
        port: 10022
        multihost name: default-0
        arch: x86_64
        distro: Fedora Linux 40 (Cloud Edition)
        kernel: 6.8.5-301.fc40.x86_64
        package manager: dnf
        selinux: yes
        is superuser: yes

        summary: 1 guest provisioned
    prepare
        queued push task #1: push to default-0

        push task #1: push to default-0

        queued prepare task #1: requires on default-0

        prepare task #1: requires on default-0
        how: install
        summary: Install required packages
        name: requires
        order: 70
        where: default-0
        package: 1 package requested
            /usr/bin/flock
            cmd: rpm -q --whatprovides /usr/bin/flock || dnf install -y  /usr/bin/flock
            out: util-linux-core-2.40-0.9.rc1.fc40.x86_64

        queued pull task #1: pull from default-0

        pull task #1: pull from default-0

        summary: 1 preparation applied
    execute
        queued execute task #1: default-0 on default-0

        execute task #1: default-0 on default-0
        how: tmt
        order: 50
        exit-first: false
            test: /script-00
                cmd:
                    echo 2 > /proc/sys/kernel/panic
                    sync
                    if [ "$TMT_REBOOT_COUNT" == 0 ]; then
                       # tmt-reboot -c "echo c > /proc/sysrq-trigger"
                       echo c > /proc/sysrq-trigger
                    fi
                    echo "Test passed"
                out: Shared connection to 127.0.0.1 closed.
                00:00:15 errr /script-00 (on default-0) [1/1]

        summary: 1 test executed
    report
        how: display
        order: 50
            errr /script-00
                output.txt: /var/tmp/tmt/run-020/example/execute/data/guest/default-0/script-00-1/output.txt
                content: Shared connection to 127.0.0.1 closed.

        summary: 1 error
    finish

        guest: stopped
        guest: removed
    Prune '/example' plan workdir '/var/tmp/tmt/run-020/example'.
        summary: 0 tasks completed

total: 1 error

Reproducer


mkdir .fmf
echo -n 1 > .fmf/version

cat << 'EOF' > example.fmf
summary: Basic kernel panic test

provision:
    how: virtual

execute:
    how: tmt
    script: |
      echo 2 > /proc/sys/kernel/panic
      sync
      if [ $TMT_REBOOT_COUNT == 0 ]; then
         # tmt-reboot -c "echo c > /proc/sysrq-trigger"
         echo c > /proc/sysrq-trigger
      fi
EOF

if tmt run; then
      echo "Test passed"
else
      echo "Test failed"
fi
happz commented 6 days ago

After triggering a kernel panic, the system can be rebooted but the test just failed with error.

AFAICT, that is the expected outcome: test (and guest) did not reboot, they crashed, the underlying SSH session was abruptly terminated and all tmt got out of it was an exit code 255. From tmt's point of view, this is a mere crash, and it does not know what else to do than report an error and move on.

I notice a workaround is to execute the kernel panic trigger command by tmt-reboot.

Yes, because then the process runs under tmt control, tmt is aware that a reboot is expected: tmt gets your "reboot command", echo c > /proc/sysrq-trigger, terminates the current SSH session running the test, and connects to the guest once again to run the command you provided. And expects this will result in a reboot, and will continue by waiting for the guest to recover and will restart the test.

Frankly said, you decided to kill the guest without telling tmt about it, so the error outcome is perfectly valid :) I'm not sure we can ever resolve this in some automagical way, tmt being able to realize, something like "aha, this is a kernel panic, guest is rebooting, I shall restart the test!". All ideas we have eventually boil down to letting tmt know about it so it can cooperate with your test. See e.g. https://tmt.readthedocs.io/en/stable/spec/tests.html#restart, I'd say it fits your use case:

restart-on-exit-code:
  # this is the exit code tmt receives when SSH session - and the guest - die
  # suddenly due to a crash
  - 255

# I'd set this to `false`, your test already issues the reboot
restart-with-reboot: false

This should tell tmt that it should wait for the reboot to pass, and reconnect and restart the test.

coiby commented 5 days ago

Thanks for clarification! Because beaker job can resume the test after kernel panic automatically so I expect tmt to also support panic.

I just tested restart-on-exit-code which still lead to error. Do I still miss anything?

mkdir .fmf
echo -n 1 > .fmf/version

cat << 'EOF' > main.fmf

/tests:
    /basic:
        restart-on-exit-code:
           - 255
        test: |
              echo 2 > /proc/sys/kernel/panic
              sync
              if [ "$TMT_REBOOT_COUNT" == 0 ]; then
                 # tmt-reboot -c "echo c > /proc/sysrq-trigger"
                 echo c > /proc/sysrq-trigger
              fi
              echo "Test passed"
EOF

if tmt run -a provision -h virtual; then
      echo "Test passed"
else
      echo "Test failed"
fi

The following logs with tmt -vvvddd may be relevant,

flock "$TMT_TEST_PIDFILE_LOCK" -c "rm -f ${TMT_TEST_PIDFILE}" || exit 123;

exit $_exit_code;'
                cmd:
                    echo 2 > /proc/sys/kernel/panic
                    sync
                    if [ "$TMT_REBOOT_COUNT" == 0 ]; then
                       # tmt-reboot -c "echo c > /proc/sysrq-trigger"
                       echo c > /proc/sysrq-trigger
                    fi
                    echo "Test passed"
                out: Shared connection to 127.0.0.1 closed.
        Command returned '255' (unrecognized).
        Append to file '/var/tmp/tmt/run-011/tests/basic/execute/data/guest/default-0/tests/basic-1/output.txt'.
        Extract results of '/tests/basic'.
            Run command: git rev-parse --is-inside-work-tree
            err: fatal: not a git repository (or any parent up to mount point /)
            err: Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
            Command returned '128' (unrecognized).
                00:00:15 /tests/basic [1/1]
...
happz commented 5 days ago

They are relevant, but just a snippet of the full picture. To work correctly, your test needs to check TMT_TEST_RESTART_COUNT instead of TMT_REBOOT_COUNT - note that the reboot is outside of tmt's control, it's not managed by tmt, it's not even detected by tmt, therefore TMT_REBOOT_COUNT will remain unchanged, but TMT_TEST_RESTART_COUNT will be increased as tmt does restart the test.

Plus there is indeed one minor issue that may lead to errors, see https://github.com/teemtee/tmt/pull/3291.

Together with these two changes, I get an expected picture:

/var/tmp/tmt/run-123
Found 1 plan.

/plans
summary: Basic kernel panic test
    discover
        how: fmf
        directory: /tmp/foo
        summary: 1 test selected
            /tests/basic
    provision
        queued provision.provision task #1: default-0

        provision.provision task #1: default-0
        how: virtual
        memory: 2048 MB
        disk: 40 GB
        qcow: Fedora-Cloud-Base-Generic.x86_64-40-1.14.qcow2
        effective hardware: {}
        name: tmt-123-LVjyAajf
        key: /var/tmp/tmt/run-123/plans/provision/default-0/id_ecdsa
        progress: booting...
        primary address: 127.0.0.1
        topology address: 127.0.0.1
        port: 10056
        multihost name: default-0
        arch: x86_64
        distro: Fedora Linux 40 (Cloud Edition)
        kernel: 6.8.5-301.fc40.x86_64
        package manager: dnf
        selinux: yes
        is superuser: yes

        summary: 1 guest provisioned
    prepare
        queued push task #1: push to default-0

        push task #1: push to default-0

        queued prepare task #1: requires on default-0

        prepare task #1: requires on default-0
        how: install
        summary: Install required packages
        name: requires
        where: default-0
        package: 1 package requested
            /usr/bin/flock
            cmd: rpm -q --whatprovides /usr/bin/flock || dnf install -y  /usr/bin/flock

        queued pull task #1: pull from default-0

        pull task #1: pull from default-0

        summary: 1 preparation applied
    execute
        queued execute task #1: default-0 on default-0

        execute task #1: default-0 on default-0
        how: tmt
        exit-first: false
            test: /tests/basic
                cmd:
                    echo 2 > /proc/sys/kernel/panic
                    sync
                    if [ "$TMT_TEST_RESTART_COUNT" == 0 ]; then
                       # tmt-reboot -c "echo c > /proc/sysrq-trigger"
                       echo c > /proc/sysrq-trigger
                    fi
                    echo "Test passed"
                00:00:09 /tests/basic [1/1]
            test: /tests/basic
                cmd:
                    echo 2 > /proc/sys/kernel/panic
                    sync
                    if [ "$TMT_TEST_RESTART_COUNT" == 0 ]; then
                       # tmt-reboot -c "echo c > /proc/sysrq-trigger"
                       echo c > /proc/sysrq-trigger
                    fi
                    echo "Test passed"
                00:00:00 pass /tests/basic (on default-0) [1/1]

        summary: 1 test executed
    report
        how: display
            pass /tests/basic
                output.txt: /var/tmp/tmt/run-123/plans/execute/data/guest/default-0/tests/basic-1/output.txt
        summary: 1 test passed
    finish

        guest: stopped
        guest: removed
        summary: 0 tasks completed

total: 1 test passed