near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 614 forks source link

[TestLoop] test doesn't end when run_until timeouts #11840

Open jancionear opened 1 month ago

jancionear commented 1 month ago

TestLoopV2::run_until runs testloop until the condition is met or it hits the timeout. In case of a timeout, run_until panics and the test should fail.

I've observed that when run_until panics, the test doesn't stop. It just hangs until it reaches the timeout.

For example, running

cargo nextest run -p integration-tests test_client_with_simple_test_loop

On this branch, where the test is modified to fail inside run_until, will hang.

Running with the --nocapture flag shows that the panic is triggered, but the test doesn't stop.

Attaching gdb to the hanged process shows that it's waiting on some mutex. It looks like it's related to waiting for the test to be completed, not sure:

(gdb) bt
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x0000564f988098d3 in std::sys::pal::unix::futex::futex_wait () at library/std/src/sys/pal/unix/futex.rs:62
#2  std::sys_common::thread_parking::futex::Parker::park_timeout () at library/std/src/sys_common/thread_parking/futex.rs:72
#3  std::thread::park_timeout () at library/std/src/thread/mod.rs:1165
#4  0x0000564f945d70b4 in std::sync::mpmc::context::Context::wait_until () at library/std/src/sync/mpmc/context.rs:130
#5  std::sync::mpmc::list::{impl#3}::recv::{closure#1}<test::event::CompletedTest> () at library/std/src/sync/mpmc/list.rs:444
#6  0x0000564f945d6be2 in std::sync::mpmc::context::{impl#0}::with::{closure#0}<std::sync::mpmc::list::{impl#3}::recv::{closure_env#1}<test::event::CompletedTest>, ()> ()
    at library/std/src/sync/mpmc/context.rs:50
#7  std::sync::mpmc::context::{impl#0}::with::{closure#1}<std::sync::mpmc::list::{impl#3}::recv::{closure_env#1}<test::event::CompletedTest>, ()> ()
    at library/std/src/sync/mpmc/context.rs:58
#8  std::thread::local::LocalKey::try_with<core::cell::Cell<core::option::Option<std::sync::mpmc::context::Context>>, std::sync::mpmc::context::{impl#0}::with::{closure_env#1}<std::sync::mpmc::list::{impl#3}::recv::{closure_env#1}<test::event::CompletedTest>, ()>, ()> () at library/std/src/thread/local.rs:286
#9  std::sync::mpmc::context::Context::with<std::sync::mpmc::list::{impl#3}::recv::{closure_env#1}<test::event::CompletedTest>, ()> () at library/std/src/sync/mpmc/context.rs:53
#10 std::sync::mpmc::list::Channel::recv<test::event::CompletedTest> () at library/std/src/sync/mpmc/list.rs:434
#11 0x0000564f945eca4f in std::sync::mpmc::Receiver::recv_deadline<test::event::CompletedTest> () at library/std/src/sync/mpmc/mod.rs:340
#12 std::sync::mpmc::Receiver::recv_timeout<test::event::CompletedTest> () at library/std/src/sync/mpmc/mod.rs:323
#13 std::sync::mpsc::Receiver::recv_timeout<test::event::CompletedTest> () at library/std/src/sync/mpsc/mod.rs:909
#14 test::run_tests<test::console::run_tests_console::{closure_env#2}> () at library/test/src/lib.rs:418
#15 test::console::run_tests_console () at library/test/src/console.rs:322
#16 0x0000564f946080e9 in test::test_main () at library/test/src/lib.rs:142
#17 0x0000564f94608ecb in test::test_main_static () at library/test/src/lib.rs:164
#18 0x0000564f9416ce73 in integration_tests::main () at integration-tests/src/lib.rs:1

Maybe panicking causes the event::CompletedTest event to not be sent? It's weird, tests should be allowed to panic :thinking:

/cc @robin-near @shreyan-gupta

jancionear commented 1 month ago

The same thing happens with the standard cargo test, so it's not a nextest issue:

cargo test -p integration-tests test_client_with_simple_test_loop -- --nocapture
<hangs>
shreyan-gupta commented 1 month ago

I think this is a known issue? Confirming with @robin-near, if these are the same? https://github.com/near/nearcore/issues/11447

I remember @tayfunelmas might have looked into a similar issue?

tayfunelmas commented 1 month ago

This could be an issue similar to what is addressed #11653, where some code blocks on a message that never arrives due to the panic making the message-sender fail to send it.