Make login test case more reliable

mgalka commented 3 years ago

auto-login-action gives "PASS" result even when a kernel panic occurrs, which leads to false positives. These changes check if there are any other test suites apart from "lava" mentioned in the callback and only then consider "auto-login-action" as a reliable source of login test case result. If no other test suites, apart from "lava" are not present the backend uses "job" test case to determine if login test case was successful.

mgalka commented 3 years ago

I ran some tests on my local setup

Build a kernel that fails with "Unhandled fault" during boot.

Run baseline_qemu test plan on LAVA We can see that the kernel boot fails with "Unhandled fault"

<6>[    2.206134] ThumbEE CPU extension supported.
<5>[    2.207467] Registering SWP/SWPB emulation handler
<5>[    2.209165] Loading compiled-in X.509 certificates
<6>[    2.226189] input: gpio-keys as /devices/platform/gpio-keys/input/input0
<1>[    2.235006] 8<--- cut here ---
<1>[    2.235200] Unhandled fault: page domain fault (0x01b) at 0x00000000
...
<0>[    2.263909] Exception stack(0xee8c3fb0 to 0xee8c3ff8)
<0>[    2.264113] 3fa0:                                     00000000 00000000 00000000 00000000
<0>[    2.264438] 3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[    2.264824] 3fe0: 00000000 00000000 00000000 00000000 00000013 00000000
<0>[    2.265264] Code: ebfd7308 e3700a01 e1a05000 8a000073 (e5953000) 
<4>[    2.266006] ---[ end trace f31bc6e61e4dc98f ]---
<0>[    2.266283] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
<0>[    2.266805] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Despite the kernel boot issue the login-action results with a false positive

case: login-action
case_id: 15972
definition: lava
duration: 2.05
extra: ...
level: 2.2.1
namespace: common
result: pass

KernelCI with changes provided in this PR handles such situation and sets login test case status to "FAIL"

{
    "_id" : ObjectId("5f92ec0780cb8bb3beef2a2c"),
    "test_case_path" : "baseline.login",
    "test_group_id" : ObjectId("5f92ec0780cb8bb3beef2a2b"),
    "created_on" : ISODate("2020-10-23T15:09:58.630Z"),
    "mach" : "qemu",
    "git_branch" : "unhandled_exception",
    "log_lines" : [],
    "arch" : "arm",
    "defconfig_full" : "multi_v7_defconfig",
    "measurements" : [],
    "job" : "mgalka",
    "lab_name" : "mgalka-lava-local",
    "index" : 1,
    "kernel" : "5680d14d59bd",
    "git_commit" : "5680d14d59bddc8bcbc5badf00dbbd4374858497",
    "device_type" : "qemu_arm-virt-gicv3",
    "version" : "1.0",
    "time" : ISODate("1970-01-01T00:00:00.000Z"),
    "name" : "login",
    "build_environment" : "gcc-8",
    "status" : "FAIL",
    "regression_id" : null,
    "plan" : "baseline"
}

gctucker commented 3 years ago

Looks like this is causing some exceptions:

Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]: [   ERROR/MainThread] Task lava-test[d9f0b75e-bc63-4862-ae62-02f3d1c0a30a] raised unexpected: AttributeError("'NoneType' object has no attribute 'get'",)
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]: Traceback (most recent call last):
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]:   File "/srv/.venv/api.staging.kernelci.org/local/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]:     R = retval = fun(*args, **kwargs)
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]:   File "/srv/.venv/api.staging.kernelci.org/local/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]:     return self.run(*args, **kwargs)
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]:   File "/srv/api.staging.kernelci.org/app/taskqueue/tasks/callback.py", line 42, in lava_test
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]:     taskc.app.conf.db_options)
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]:   File "/srv/api.staging.kernelci.org/app/utils/callback/lava.py", line 684, in add_tests
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]:     if login_tc.get('result') == 'pass' and len(groups) == 0:
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]: AttributeError: 'NoneType' object has no attribute 'get'
Nov 02 06:58:23 kernelci-staging kernelci-celery[28049]: [   ERROR/MainThread] Task failed, UUID: d9f0b75e-bc63-4862-ae62-02f3d1c0a30a, error: 'NoneType' object has no attribute 'get'

gctucker commented 3 years ago

Adding staging-skip tag as this is breaking some test results.

@mgalka Please fix this issue.

mgalka commented 3 years ago

Staging tests completed.

An Unhandled Fault can be observed during the kernel boot: https://lava.collabora.co.uk/scheduler/job/2801576#L415 , but auto-login-action status is set to pass: https://lava.collabora.co.uk/scheduler/job/2801576#results_113505255 The kernel didn't boot and no other test cases were run.

After processing the request by kernelci-backend baseline.login test case status is set to fail: https://staging.kernelci.org/test/case/id/5fa95e2e6f6a4359c4c93394/

gctucker commented 3 years ago

I see, great.

Also here's the view with all the baseline results for that test run: https://staging.kernelci.org/test/plan/id/5fa95e2e6f6a4359c4c93393/

kernelci / kernelci-backend

Make login test case more reliable #262