slow_test node.fork_storm:telemetry.under_load fails every time

dsiganos commented 2 years ago

Node version: v23.0

Note: incrase the file limit before running the test: ulimit -n 10000

ds@tux:~/nano/releases/V23.0/build$ ulimit -n 10000
ds@tux:~/nano/releases/V23.0/build$ ./slow_test --gtest_filter=node.fork_storm:telemetry.under_load
Note: Google Test filter = node.fork_storm:telemetry.under_load
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from node
[ RUN      ] node.fork_storm
Assertion (iterations1 < 10000) failed
std::shared_ptr<nano::node> nano::system::add_node(const nano::node_config&, nano::node_flags, nano::transport::transport_type)
/home/ds/nano/releases/V23.0/nano-node/nano/test_common/system.cpp:93

 0# nano::generate_stacktrace[abi:cxx11]() at /home/ds/nano/releases/V23.0/nano-node/nano/lib/utility.cpp:129
 1# assert_internal(char const*, char const*, char const*, unsigned int, bool, std::basic_string_view<char, std::char_traits<char> >) at /home/ds/nano/releases/V23.0/nano-node/nano/lib/utility.cpp:175
 2# nano::system::add_node(nano::node_config const&, nano::node_flags, nano::transport::transport_type) at /home/ds/nano/releases/V23.0/nano-node/nano/test_common/system.cpp:93
 3# nano::system::system(unsigned short, nano::transport::transport_type, nano::node_flags) at /home/ds/nano/releases/V23.0/nano-node/nano/test_common/system.cpp:127
 4# node_fork_storm_Test::TestBody() at /home/ds/nano/releases/V23.0/nano-node/nano/slow_test/node.cpp:195
 5# void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:2607
 6# void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:2643
 7# testing::Test::Run() at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:2689
 8# testing::TestInfo::Run() at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:2864
 9# testing::TestSuite::Run() at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:3016
10# testing::internal::UnitTestImpl::RunAllTests() at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:5856
11# bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:2609
12# bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:2643
13# testing::UnitTest::Run() at /home/ds/nano/releases/V23.0/nano-node/gtest/googletest/src/gtest.cc:5441
14# RUN_ALL_TESTS() in ./slow_test
15# main at /home/ds/nano/releases/V23.0/nano-node/nano/slow_test/entry.cpp:15
16# __libc_start_main in /lib/x86_64-linux-gnu/libc.so.6
17# _start in ./slow_test

Aborted

Justin-Randall commented 2 years ago

The poll() during the test is pretty suspect. It is effectively a timeout failure set to 500000 milliseconds.

while (std::any_of (begin, nodes.end (), [] (std::shared_ptr<nano::node> const & node_a) { return node_a->bootstrap_initiator.in_progress (); }))
{
    poll ();
    ++iterations1;
    debug_assert (iterations1 < 10000);
}

Not sure why this test spins up 64 nodes when anything > NUM_CPUs should suffice.

TEST (node, fork_storm)
{
    nano::node_flags flags;
    flags.disable_max_peers_per_ip = true;
    nano::system system (64, nano::transport::transport_type::tcp, flags);
...

Once you are past that, the rest of the test intermittently loops forever when this evaluates to true:

again = empty != 0 || single != 0;

in cases I have seen, empty is the number of nodes allocated at the beginning of the test and single == 0.

Empty: 16 single: 0
Empty: 16 single: 0
Empty: 16 single: 0
Empty: 16 single: 0

Where I intentionally reduced the node allocation count locally to improve iteration times.

Or:

Empty: 15 single: 1
Empty: 15 single: 1
Empty: 15 single: 1
Empty: 15 single: 1

Forever....

When that doesn't happen, sometimes it simply dumps core during the test.

The lack of deterministic behavior in this test really smells funny, meaning there may be an underlying problem it has uncovered that needs attention.

qwahzi commented 1 year ago

@dsiganos - Here is another issue that looks like it can be closed due to a merged PR

dsiganos commented 1 year ago

These tests are still failing.

nanocurrency / nano-node

slow_test node.fork_storm:telemetry.under_load fails every time #3709