openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 419 forks source link

Hang in dcx/test_ucp_peer_failure_2pairs.status_after_error/0 #2537

Open shamisp opened 6 years ago

shamisp commented 6 years ago
$ make -C test/gtest/ GTEST_FILTER="*test_ucp_peer_failure_2pairs.status_after_error*" test

RHEL7.5, OFED-internal-4.3-1.0.1, UCX master

dc/test_ucp_peer_failure_2pairs.status_after_error/0 test runs fine.

shamisp commented 6 years ago

@gmegan - FYI

yosefe commented 6 years ago

we saw this happen because of timeout on 2nd pair, so smoke_test wait-for-recv does not complete. @shamisp do you see smoke_test() in the backtrace, if attaching with gdb?

shamisp commented 6 years ago

@gmegan can you please double check ?

gmegan commented 6 years ago

Smoke test appears in the backtrace. For this test, it hung on udx/test_ucp_peer_failure_2pairs.status_after_error/0


gtest --gtest_filter=test_ucp_peer_failure_2pairs.status_after_error [ INFO ] checking dc: no [ INFO ] checking dcx: no [ INFO ] checking ud: yes [ INFO ] checking udx: yes [ INFO ] checking rc: yes [ INFO ] checking rcx: yes [ INFO ] checking shm_ib: yes [ INFO ] checking ugni: no [ INFO ] checking self: yes [ INFO ] checking tcp: yes [1525358073.161028] [vulcan2:36017:0] ucp_context.c:557 UCX WARN transport '\dc' is not available [ INFO ] checking all_rcdc: yes [ INFO ] checking all: yes [ INFO ] checking rc_x: yes [ INFO ] checking ud_mlx5: yes [ INFO ] checking shm: yes [1525358073.297950] [vulcan2:36017:0] ucp_context.c:557 UCX WARN transport 'rdmacm' is not available [ INFO ] checking mm_rdmacm: yes [ INFO ] Using random seed of 7673 Note: Google Test filter = test_ucp_peer_failure_2pairs.status_after_error [==========] Running 7 tests from 7 test cases. [----------] Global test environment set-up. [----------] 1 test from ud/test_ucp_peer_failure_2pairs [ RUN ] ud/test_ucp_peer_failure_2pairs.status_after_error/0 [ OK ] ud/test_ucp_peer_failure_2pairs.status_after_error/0 (1138 ms) [----------] 1 test from ud/test_ucp_peer_failure_2pairs (1138 ms total)

[----------] 1 test from udx/test_ucp_peer_failure_2pairs [ RUN ] udx/test_ucp_peer_failure_2pairs.status_after_error/0


(gdb) backtrace

0 0x0000ffffb1722fb4 in uct_ib_mlx5_poll_cq (cq=0x65dbea8, iface=0x65d9780)

at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/uct/ib/mlx5/ib_mlx5.inl:31

1 uct_ud_mlx5_iface_poll_rx (is_async=0, iface=0x65d9780) at ../../../src/uct/ib/ud/accel/ud_mlx5.c:389

2 uct_ud_mlx5_iface_progress (tl_iface=0x65d9780) at ../../../src/uct/ib/ud/accel/ud_mlx5.c:460

3 0x0000ffffb15fc0b4 in ucs_callbackq_dispatch (cbq=)

at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/ucs/datastruct/callbackq.h:208

4 uct_worker_progress (worker=)

at /home/meggro01/local/src/maas-tools/shmem-setup/BUILD/ucx/BUILD/../src/uct/api/uct.h:1650

5 ucp_worker_progress (worker=0x661d1a0) at ../../../src/ucp/core/ucp_worker.c:1356

6 0x0000000000588c54 in progress (this=0x656d360) at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:546

7 test_ucp_peer_failure_2pairs::wait_req (this=this@entry=0x656d360, req=req@entry=0x65f4e20)

at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:571

8 0x0000000000589184 in wait_req (req=0x65f4e20, this=0x656d360) at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:598

9 test_ucp_peer_failure_2pairs::smoke_test (this=this@entry=0x656d360, idx=idx@entry=0)

at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:597

10 0x0000000000590530 in test_ucp_peer_failure_2pairs::init (this=0x656d360) at ../../../test/gtest/ucp/test_ucp_peer_failure.cc:510

11 0x000000000047a8c8 in ucs::test_base::SetUpProxy (this=0x656d360) at ../../../test/gtest/common/test.cc:229

12 0x000000000046e06c in HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0x79f5b8 "SetUp()",

method=<optimized out>, object=0x656d3c0) at ../../../test/gtest/common/gtest-all.cc:3562

13 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x656d3c0, method=,

location=0x79f5b8 "SetUp()") at ../../../test/gtest/common/gtest-all.cc:3598

14 0x00000000004662e4 in testing::Test::Run (this=0x656d3c0) at ../../../test/gtest/common/gtest-all.cc:3630

15 0x00000000004666ac in Run (this=0x656d3c0) at ../../../test/gtest/common/gtest-all.cc:3626

16 testing::TestInfo::Run (this=0x62cdd90) at ../../../test/gtest/common/gtest-all.cc:3812

17 0x000000000046695c in Run (this=) at ../../../test/gtest/common/gtest-all.cc:3940

18 testing::TestCase::Run (this=0x62cdf00) at ../../../test/gtest/common/gtest-all.cc:3930

19 0x0000000000467700 in Run (this=) at ../../../test/gtest/common/gtest-all.cc:3916

20 testing::internal::UnitTestImpl::RunAllTests (this=this@entry=0x61c2e30) at ../../../test/gtest/common/gtest-all.cc:5802

21 0x00000000004679c4 in RunAllTests (this=0x61c2e30) at ../../../test/gtest/common/gtest-all.cc:5719

22 HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (

location=0x79f720 "auxiliary test code (environments or event listeners)", method=<optimized out>, object=0x61c2e30)
at ../../../test/gtest/common/gtest-all.cc:3562

23 HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (

location=0x79f720 "auxiliary test code (environments or event listeners)", 
method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x467908 <testing::internal::UnitTestImpl::RunAllTests()>, object=0x61c2e30) at ../../../test/gtest/common/gtest-all.cc:3598

24 testing::UnitTest::Run (this=) at ../../../test/gtest/common/gtest-all.cc:5413

25 0x0000000000414c64 in RUN_ALL_TESTS () at ../../../test/gtest/common/gtest.h:20059

26 main (argc=, argv=0xffffd1ee0128) at ../../../test/gtest/common/main.cc:85

(gdb)

yosefe commented 6 years ago

@shamisp @gmegan can you pls check if it hangs on "udx/test_ucp_peer_failure_2pairs.status_after_error/0"? i could reproduce and fix udx/test_ucp_peer_failure_2pairs.status_after_error/0 only for now

yosefe commented 5 years ago

Test hangs with valgrind: http://hpc-master.lab.mtl.com:8080/job/hpc-ucx-pr/10817/label=hpc-test-node,worker=1/console

20:04:15 [ RUN      ] dcx/test_ucp_peer_failure.zcopy/0
20:19:15 /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/test_helpers.cc:46: Failure
20:19:15 Failed
20:19:15 Connection timed out - abort testing
20:19:15 [hpc-test-node:23748:0:23748] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
20:21:40 ==== backtrace (tid:  23748) ====
20:21:40  0 0x00000000000dce45 __GI___sched_yield()  :0
20:21:40  1 0x000000000078665d ucp_test::progress()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/ucp/ucp_test.cc:148
20:21:40  2 0x000000000069fde2 test_ucp_peer_failure::do_test()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/ucp/test_ucp_peer_failure.cc:306
20:21:40  3 0x0000000000560066 ucs::test_base::run()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/test.cc:276
20:21:40  4 0x000000000054aba3 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3562
20:21:40  5 0x000000000053f01d testing::Test::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3635
20:21:40  6 0x000000000053f0ec testing::TestInfo::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3812
20:21:40  7 0x000000000053f24f testing::TestCase::Run()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:3930
20:21:40  8 0x0000000000543be7 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:5802
20:21:40  9 0x0000000000543eeb testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest-all.cc:5719
20:21:40 10 0x00000000004eb5b3 main()  /scrap/jenkins/workspace/hpc-ucx-pr-6/label/hpc-test-node/worker/1/contrib/../test/gtest/common/gtest.h:20059
20:21:40 11 0x0000000000021c05 __libc_start_main()  ???:0
20:21:40 12 0x00000000005297a4 _start()  ???:0
20:21:40 =================================
20:21:40 Sending notification to yosefe@mellanox.com
20:23:57 [hpc-test-node:23748:0:23748] Process frozen...