Open shamisp opened 6 years ago
@gmegan - FYI
we saw this happen because of timeout on 2nd pair, so smoke_test wait-for-recv does not complete. @shamisp do you see smoke_test() in the backtrace, if attaching with gdb?
@gmegan can you please double check ?
Smoke test appears in the backtrace. For this test, it hung on udx/test_ucp_peer_failure_2pairs.status_after_error/0
gtest --gtest_filter=test_ucp_peer_failure_2pairs.status_after_error [ INFO ] checking dc: no [ INFO ] checking dcx: no [ INFO ] checking ud: yes [ INFO ] checking udx: yes [ INFO ] checking rc: yes [ INFO ] checking rcx: yes [ INFO ] checking shm_ib: yes [ INFO ] checking ugni: no [ INFO ] checking self: yes [ INFO ] checking tcp: yes [1525358073.161028] [vulcan2:36017:0] ucp_context.c:557 UCX WARN transport '\dc' is not available [ INFO ] checking all_rcdc: yes [ INFO ] checking all: yes [ INFO ] checking rc_x: yes [ INFO ] checking ud_mlx5: yes [ INFO ] checking shm: yes [1525358073.297950] [vulcan2:36017:0] ucp_context.c:557 UCX WARN transport 'rdmacm' is not available [ INFO ] checking mm_rdmacm: yes [ INFO ] Using random seed of 7673 Note: Google Test filter = test_ucp_peer_failure_2pairs.status_after_error [==========] Running 7 tests from 7 test cases. [----------] Global test environment set-up. [----------] 1 test from ud/test_ucp_peer_failure_2pairs [ RUN ] ud/test_ucp_peer_failure_2pairs.status_after_error/0 [ OK ] ud/test_ucp_peer_failure_2pairs.status_after_error/0 (1138 ms) [----------] 1 test from ud/test_ucp_peer_failure_2pairs (1138 ms total)
[----------] 1 test from udx/test_ucp_peer_failure_2pairs [ RUN ] udx/test_ucp_peer_failure_2pairs.status_after_error/0
(gdb) backtrace
at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/uct/ib/mlx5/ib_mlx5.inl:31
at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/ucs/datastruct/callbackq.h:208
at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/uct/api/uct.h:1650
at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:571
at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:597
method=<optimized out>, object=0x656d3c0) at ../../../test/gtest/common/gtest-all.cc:3562
location=0x79f5b8 "SetUp()") at ../../../test/gtest/common/gtest-all.cc:3598
location=0x79f720 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x61c2e30)
at ../../../test/gtest/common/gtest-all.cc:3562
location=0x79f720 "auxiliary test code (environments or event listeners)",
method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x467908 <testing::internal::UnitTestImpl::RunAllTests()>, object=0x61c2e30) at ../../../test/gtest/common/gtest-all.cc:3598
(gdb)
@shamisp @gmegan can you pls check if it hangs on "udx/test_ucp_peer_failure_2pairs.status_after_error/0"? i could reproduce and fix udx/test_ucp_peer_failure_2pairs.status_after_error/0 only for now
Test hangs with valgrind: http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/10817/label=hpc-test-node,worker=1/console
20:04:15 [ RUN ] dcx/test_ucp_peer_failure.zcopy/0
20:19:15 /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/test_helpers.cc:46: Failure
20:19:15 Failed
20:19:15 Connection timed out - abort testing
20:19:15 [hpc-test-node:23748:0:23748] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
20:21:40 ==== backtrace (tid: 23748) ====
20:21:40 0 0x00000000000dce45 __GI___sched_yield() :0
20:21:40 1 0x000000000078665d ucp_test::progress() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/ucp/ucp_test.cc:148
20:21:40 2 0x000000000069fde2 test_ucp_peer_failure::do_test() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/ucp/test_ucp_peer_failure.cc:306
20:21:40 3 0x0000000000560066 ucs::test_base::run() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/test.cc:276
20:21:40 4 0x000000000054aba3 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3562
20:21:40 5 0x000000000053f01d testing::Test::Run() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3635
20:21:40 6 0x000000000053f0ec testing::TestInfo::Run() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3812
20:21:40 7 0x000000000053f24f testing::TestCase::Run() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3930
20:21:40 8 0x0000000000543be7 testing::internal::UnitTestImpl::RunAllTests() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:5802
20:21:40 9 0x0000000000543eeb testing::internal::UnitTestImpl::RunAllTests() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:5719
20:21:40 10 0x00000000004eb5b3 main() /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest.h:20059
20:21:40 11 0x0000000000021c05 __libc_start_main() ???:0
20:21:40 12 0x00000000005297a4 _start() ???:0
20:21:40 =================================
20:21:40 Sending notification to yosefe@mellanox.com
20:23:57 [hpc-test-node:23748:0:23748] Process frozen...
RHEL7.5, OFED-internal-4.3-1.0.1, UCX master
dc/test_ucp_peer_failure_2pairs.status_after_error/0 test runs fine.