metan-ucw / runltp-ng

Minimalistic LTP testrunner
11 stars 16 forks source link

Leaked log message from previous Syzkaller reproducer run #21

Open shunghsiyu opened 4 years ago

shunghsiyu commented 4 years ago

Sometimes printk messages do not immediately show up when the reproducer runs, and are only emitted when the next reproducer starts to run.

In the example below, we can see the backtrace for previous reproducer PID: 22329 Comm: 7e93129ee310878 showed up in the result of 7e7470f0058e0950706a4fd684c2d86c7b43df31.

{
  "tid": "7e7470f0058e0950706a4fd684c2d86c7b43df31",
  "runs": 1,
  "passed": 0,
  "failed": 1,
  "broken": 0,
  "skipped": 0,
  "warnings": 0,
  "runtime": "14.1485938299447",
  "log": [
    "...omitted...",
    "kL.{$d'reH~_bPm~Uq.1:?_.|[y{gSHeU&Oo{n`r+pr/Q7^U'rDZ4yqrF1umrVE]ZPs)|L@!_9[5pesyzwrap  -d /opt/ltp/testcases/bin -n 7e7470f0058e0950706a4fd684c2d86c7b43ddf31; echo cmd-exit-108-$?",
    "tst_test.c:1245: INFO: Timeout per run is 0h 00m 20s",
    "syzwrap.c:71: INFO: https://syzkaller.appspot.com/bug?id=7e7470f0058e0950706a4fd684c2d86c7b43df31",
    "syzwrap.c:74: INFO: /opt/ltp/testcases/bin/7e7470f0058e0950706a4fd684c2d86c7b43df31",
    "7e93129ee310878a85ad9d6617cf768b635d2aee: cxa_atexit.c:100: __new_exitfn: Assertion `l != NULL' failed.",
    "[  296.036863] 000: traps: 7e93129ee310878[22272] general protection fault ip:7f09e6f95e79 sp:7f09e7768cc8 error:0 in libc-2.26.so[7f09e6f72000+1b2000]",
    "[  296.576930] 000: ------------[ cut here ]------------",
    "[  296.576975] 000: WARNING: CPU: 0 PID: 22329 at fs/super.c:236 destroy_unused_super.part.6+0x120/0x150",
    "[  296.576990] 000: Modules linked in:",
    "[  296.576993] 000: CPU: 0 PID: 22329 Comm: 7e93129ee310878 Not tainted 5.3.18-rt_syzkaller #1 SLE15-SP2 (unreleased)",
    "...omitted..."
  ]
}

This is a hard one to solve because there may be task related to last reproducer lingering around somewhere (e.g. inside work queues).

Some ideas for this issue:

  1. Adding sleep after each reproducer run (slow down a little bit and can't reliably avoid this issue)
  2. Reboot after each reproducer run (a bit crazy as this will cause a massive slow down, but should work reliably)
  3. Using dmesg to flush the printk buffer (not sure if this will work)
metan-ucw commented 4 years ago

I doubt that this will ever get fixed, the timing are just too unpredictable and rebooting or even loading from snapshot would make the testrun way too slow.

I guess that the best we can do is to keep the dmesg in one big file and only store line offset to the relevant part in the testcase.

richiejp commented 4 years ago

We could try waiting for the system load to decrease below some threshold before continuing. However it seems like some of the metrics for measuring load are broken on the low latency scheduler on my machine so this could be tricky.

I guess you can also compare the PIDs in the stack trace with the reproducer's.

shunghsiyu commented 4 years ago

I decide to take the crazy idea of rebooting after every single reproducer run to see how painful vs useful it is.

Any suggestion for the runltp-ng option name if the crazy idea is useful enough for a PR? (--always-reboot?)

richiejp commented 4 years ago

I'd prefer the more generic --always-revert because of #23