Test process crashes with SIGSEGV after it appears to pass

weiznich commented 1 month ago

The diesel project sometimes encounter a sigsegv (possibly in nextest?) after a test is marked as passing by the built-in test runtime.

See the example test output:

 --- STDOUT:              diesel r2d2::tests::check_pool_does_actually_hold_connections ---

running 1 test
test r2d2::tests::check_pool_does_actually_hold_connections ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 52 filtered out; finished in 0.00s

--- STDERR:              diesel r2d2::tests::check_pool_does_actually_hold_connections ---
thread 'r2d2::tests::check_pool_does_actually_hold_connections' panicked at diesel/src/r2d2.rs:626:13:
explicit panic
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

example test run from diesel

(The panic output is expected as the test tests something around unwinding and panic handling)

Nextest version: cargo-nextest-nextest 0.9.78 Platform: Ubuntu (Github actions ubuntu-latest environment)

sunshowers commented 1 month ago

Thanks. That looks like your test segfaulting, not nextest.

weiznich commented 1 month ago

Thanks for the fast response. I think a segfault in that test is unlikely because:

the test itself finishes as indicated by the output. That means the segfault must happen after any test code in the tear down code of the test runtime. At that point all destructors and any other code related to our tests should have already finished
we run the same test in the same environment with Asan enabled, but without nextest. The test passes there without any issue.

I could be wrong about that and it's something different but at least for me these points indicate that the issue is somewhere else.

sunshowers commented 1 month ago

That means the segfault must happen after any test code in the tear down code of the test runtime. At that point all destructors and any other code related to our tests should have already finished

Interesting. Really strange then — but it's definitely not nextest that's segfaulting here, nextest seems to really believe the test is segfaulting. (The error indicates that the test process exited with a SEGV).

Could you try running in CI with NEXTEST_DOUBLE_SPAWN=0? When nextest spawns tests, it first spawns a copy of itself, which then execs the test. Maybe something's going wrong there. But note that this spawned copy execs the test (ie it replaces the process image) so I'm not sure what could be happening there.

I think this may require some debugging work on your end.

weiznich commented 1 month ago

I'm sorry to coming back here again, but I still believe the issue must be in nextest. I've added the NEXTEST_DOUBLE_SPAWN=0 environment variable as suggested. Additionally I've also added a run of the test via the default test runner cargo test before running it via nextest and it passes there. See here for the test run and here for the changes to the CI config. At least for me that all points to nextest as possible source of this problem, even if that sounds impossible for you.

sunshowers commented 1 month ago

Well, I just tried this locally on Ubuntu 22.04 and also saw a segfault:

% cargo test -p diesel --features "mysql r2d2" r2d2::tests::check_pool_does_actually_hold_connections
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.10s
     Running unittests src/lib.rs (target/debug/deps/diesel-e68fa223a51fe070)

running 1 test
test r2d2::tests::check_pool_does_actually_hold_connections ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 33 filtered out; finished in 0.04s

error: test failed, to rerun pass `-p diesel --lib`

Caused by:
  process didn't exit successfully: `/home/rain/dev/diesel/target/debug/deps/diesel-e68fa223a51fe070 'r2d2::tests::check_pool_does_actually_hold_connections'` (signal: 11, SIGSEGV: invalid memory reference)

Here's t a a bt on the core file in gdb (I seem to be missing some symbols): https://gist.github.com/sunshowers/6a68cd5e0d4537e1c69b84f0374b7668

Looks like the segfault is in OpenSSL.

I think you might have to roll up your sleeves and do some core file analysis :)

weiznich commented 1 month ago

Thanks for testing and providing the backtrace. I've tried to reproduce this locally in a docker container with ubuntu 22.04. The problem does not occure there as well. Could you share your the exact versions of the installed openssl and libmysqlclient package?

sunshowers commented 1 month ago

For OpenSSL, Ubuntu says:

libssl-dev/jammy-security,jammy-updates,now 3.0.2-0ubuntu1.18 amd64 [installed]
  Secure Sockets Layer toolkit - development files

pkg-config openssl --debug says: https://gist.github.com/sunshowers/9cb8e144d3c9f7755daec4adb59cf6b2

For mysqlclient:

libmysqlclient-dev/jammy-security,jammy-updates,now 8.0.39-0ubuntu0.22.04.1 amd64 [installed]
  MySQL database development files

libmysqlclient21/jammy-security,jammy-updates,now 8.0.39-0ubuntu0.22.04.1 amd64 [installed,automatic]
  MySQL database client library

pkg-config mysqlclient --debug: https://gist.github.com/sunshowers/3d412be429d4291226611f43cfe7a18a

I'd be curious to hear what you find out. Good luck!

sunshowers commented 1 month ago

Interestingly, the segfault is nondeterministic for me. With 5 runs of cargo test -p diesel --features "mysql r2d2" r2d2::tests::check_pool_does_actually_hold_connections 2 of them segfaulted and 3 passed. Possibly some kind of destructor race?

weiznich commented 1 month ago

It seems like a sleep at the end of the tests works around that problem. https://github.com/diesel-rs/diesel/pull/4269/commits/33ff9ee6062d2ef454853b8ba77524f49cc4e148

I'm still not sure what exactly is going on there and if that's a bug in diesel or somewhere else. From looking at the test we end up with a pool that dropped all connection as we mark the last one as broken due to the panic. That means it will try to get a new connection in the background and we stop? in the middle of that?, maybe that's somehow problematic due to the openssl internals?

Anyway, it's now clear that it is not an nextest issue, although I think the reporting could be clearer here so that it is more obvious that the test binary segfaulted after it outputs that the test was successful.

nextest-rs / nextest

Test process crashes with SIGSEGV after it appears to pass #1727