openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

dcx/test_ucp_peer_failure.zcopy/3 failed on swx-rdmz-ucx-arm-hwi #6203

Open dmitrygx opened 3 years ago

dmitrygx commented 3 years ago
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from dcx/test_ucp_peer_failure
[ RUN      ] dcx/test_ucp_peer_failure.zcopy/3 <dc_x/rma_fail_imm>
[New Thread 0xfffff6d1f140 (LWP 55770)]
[Thread 0xfffff6d1f140 (LWP 55770) exited]
[New Thread 0xfffff6d1f140 (LWP 55778)]
[     INFO ] < request 0xaaaaac408ec0 completed with error Request canceled >
ucp/test_ucp_peer_failure.cc:332: Failure
Value of: status
  Actual: -16
Expected: m_err_status
Which is: -80

the issue was reproduced locally on swx-rdmz-ucx-arm-hwi and during Azure CI testing

dmitrygx commented 3 years ago
[----------] Global test environment set-up.
[----------] 1 test from dcx/test_ucp_peer_failure
[ RUN      ] dcx/test_ucp_peer_failure.zcopy/3 <dc_x/rma_fail_imm>
[New Thread 0xfffff6d1f140 (LWP 53103)]
[Thread 0xfffff6d1f140 (LWP 53103) exited]
[New Thread 0xfffff6d1f140 (LWP 53113)]
[swx-rdmz-ucx-arm-hwi:40125:0:40125] test_ucp_peer_failure.cc:253  Fatal: Unexpected CANCELED status
==== backtrace (tid:  40125) ====
 0  /labhome/dmitrygla/ucx/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x78) [0xfffff7f7e6d8]
 1  /labhome/dmitrygla/ucx/src/ucs/.libs/libucs.so.0(ucs_fatal_error_message+0xf0) [0xfffff7f7b1f4]
 2  /labhome/dmitrygla/ucx/src/ucs/.libs/libucs.so.0(+0x30358) [0xfffff7f7b358]
 3  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x4b2848) [0xaaaaaaf5c848]
 4  /labhome/dmitrygla/ucx/src/ucp/.libs/libucp.so.0(+0x7c70c) [0xfffff7db170c]
 5  /labhome/dmitrygla/ucx/src/uct/ib/.libs/libuct_ib.so.0(uct_rc_txqp_purge_outstanding_op+0x268) [0xfffff7b998a8]
 6  /labhome/dmitrygla/ucx/src/uct/ib/.libs/libuct_ib.so.0(+0x10e5f8) [0xfffff7c6f5f8]
 7  /labhome/dmitrygla/ucx/src/uct/ib/.libs/libuct_ib.so.0(uct_dc_mlx5_ep_handle_failure+0x294) [0xfffff7c715b4]
 8  /labhome/dmitrygla/ucx/src/uct/ib/.libs/libuct_ib.so.0(+0x130638) [0xfffff7c91638]
 9  /labhome/dmitrygla/ucx/src/uct/ib/.libs/libuct_ib.so.0(+0x1306cc) [0xfffff7c916cc]
10  /labhome/dmitrygla/ucx/src/uct/ib/.libs/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x80) [0xfffff7b8f178]
11  /labhome/dmitrygla/ucx/src/uct/ib/.libs/libuct_ib.so.0(+0x1252f4) [0xfffff7c862f4]
12  /labhome/dmitrygla/ucx/src/ucp/.libs/libucp.so.0(+0x499ec) [0xfffff7d7e9ec]
13  /labhome/dmitrygla/ucx/src/ucp/.libs/libucp.so.0(ucp_worker_progress+0x170) [0xfffff7d89e7c]
14  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x60e4c0) [0xaaaaab0b84c0]
15  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x609ca4) [0xaaaaab0b3ca4]
16  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x60a03c) [0xaaaaab0b403c]
17  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x60a1d4) [0xaaaaab0b41d4]
18  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x609d7c) [0xaaaaab0b3d7c]
19  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x4b2cb4) [0xaaaaaaf5ccb4]
20  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x4b3b0c) [0xaaaaaaf5db0c]
21  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2d6b00) [0xaaaaaad80b00]
22  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2d6d30) [0xaaaaaad80d30]
23  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x45a29c) [0xaaaaaaf0429c]
24  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2af114) [0xaaaaaad59114]
25  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2a890c) [0xaaaaaad5290c]
26  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x28d1bc) [0xaaaaaad371bc]
27  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x28da14) [0xaaaaaad37a14]
28  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x28e0c0) [0xaaaaaad380c0]
29  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x294ae0) [0xaaaaaad3eae0]
30  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2b0398) [0xaaaaaad5a398]
31  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2a988c) [0xaaaaaad5388c]
32  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2937a4) [0xaaaaaad3d7a4]
33  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2bc288) [0xaaaaaad66288]
34  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x2bc148) [0xaaaaaad66148]
35  /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe0) [0xfffff77386e0]
36  /labhome/dmitrygla/ucx/test/gtest/gtest(+0x286d94) [0xaaaaaad30d94]
=================================
[swx-rdmz-ucx-arm-hwi:40125:0:40125] Process frozen...

Thread 414 "gtest" received signal SIGUSR1, User defined signal 1.
[Switching to Thread 0xfffff6d1f140 (LWP 53113)]
0x0000fffff77e8600 in __GI_epoll_pwait (epfd=<optimized out>, events=0xfffff6d1e740, maxevents=<optimized out>, timeout=0, set=0x0) at ../sysdeps/unix/sysv/linux/epoll_pwait.c:42
42      ../sysdeps/unix/sysv/linux/epoll_pwait.c: No such file or directory.
(gdb) t 1
[Switching to thread 1 (Thread 0xfffff7ff19c0 (LWP 40125))]
#0  0x0000fffff7b46878 in __libc_pause () at ../sysdeps/unix/sysv/linux/pause.c:32
32      ../sysdeps/unix/sysv/linux/pause.c: No such file or directory.
(gdb) bt
#0  0x0000fffff7b46878 in __libc_pause () at ../sysdeps/unix/sysv/linux/pause.c:32
#1  0x0000fffff7f7db38 in ucs_debug_freeze () at debug/debug.c:820
#2  0x0000fffff7f7df44 in ucs_error_freeze (message=0xffffffffc660 "Fatal: Unexpected CANCELED status") at debug/debug.c:915
#3  0x0000fffff7f7e6fc in ucs_handle_error (message=0xffffffffc660 "Fatal: Unexpected CANCELED status") at debug/debug.c:1078
#4  0x0000fffff7f7b1f4 in ucs_fatal_error_message (file=0xaaaaab357be8 "ucp/test_ucp_peer_failure.cc", line=253,
    function=0xaaaaab358428 <test_ucp_peer_failure::send_cb(void*, ucs_status_t)::__FUNCTION__> "send_cb", message_buf=0xffffffffc660 "Fatal: Unexpected CANCELED status") at debug/assert.c:37
#5  0x0000fffff7f7b358 in ucs_fatal_error_format (file=0xaaaaab357be8 "ucp/test_ucp_peer_failure.cc", line=253,
    function=0xaaaaab358428 <test_ucp_peer_failure::send_cb(void*, ucs_status_t)::__FUNCTION__> "send_cb", format=0xaaaaab357ca8 "Fatal: Unexpected CANCELED status") at debug/assert.c:53
#6  0x0000aaaaaaf5c848 in test_ucp_peer_failure::send_cb (request=0xaaaaac553000, status=UCS_ERR_CANCELED) at ucp/test_ucp_peer_failure.cc:253
#7  0x0000fffff7db170c in ucp_request_complete_send (status=UCS_ERR_CANCELED, req=0xaaaaac552f00) at /labhome/dmitrygla/ucx/src/ucp/core/ucp_request.inl:167
#8  ucp_rma_request_zcopy_completion (self=0xaaaaac552f88) at rma/rma_send.c:130
#9  0x0000fffff7b998a8 in uct_invoke_completion (status=UCS_ERR_CANCELED, comp=0xaaaaac552f88) at /labhome/dmitrygla/ucx/src/uct/base/uct_iface.h:684
#10 uct_rc_txqp_purge_outstanding_op (iface=0xaaaaac088000, op=0xaaaaac414400, status=UCS_ERR_CANCELED, warn=0) at rc/base/rc_ep.c:453
#11 0x0000fffff7c6f5f8 in uct_dc_mlx5_txqp_purge_outstanding (iface=0xaaaaac088000, txqp=0xaaaaac0909b8, status=UCS_ERR_CANCELED, sn=1, warn=0) at dc/dc_mlx5_ep.c:969
#12 0x0000fffff7c715b4 in uct_dc_mlx5_ep_handle_failure (ep=0xaaaaabfa7b90, arg=0xaaaaac1cb080, ep_status=UCS_ERR_CANCELED) at dc/dc_mlx5_ep.c:1363
#13 0x0000fffff7c91638 in uct_dc_mlx5_dci_handle_failure (iface=0xaaaaac088000, cqe=0xaaaaac1cb080, dci=0 '\000', status=UCS_ERR_CANCELED) at dc/dc_mlx5.c:1065
#14 0x0000fffff7c916cc in uct_dc_mlx5_iface_handle_failure (ib_iface=0xaaaaac088000, arg=0xaaaaac1cb080, status=UCS_ERR_CANCELED) at dc/dc_mlx5.c:1080
#15 0x0000fffff7b8f178 in uct_ib_mlx5_check_completion (iface=0xaaaaac088000, cq=0xaaaaac090730, cqe=0xaaaaac1cb080) at mlx5/ib_mlx5.c:360
#16 0x0000fffff7c862f4 in uct_ib_mlx5_poll_cq (cq=0xaaaaac090730, iface=0xaaaaac088000) at /labhome/dmitrygla/ucx/src/uct/ib/mlx5/ib_mlx5.inl:81
#17 uct_dc_mlx5_poll_tx (iface=0xaaaaac088000) at dc/dc_mlx5.c:212
#18 uct_dc_mlx5_iface_progress (flags=8, arg=0xaaaaac088000) at dc/dc_mlx5.c:253
#19 uct_dc_mlx5_iface_progress_ll (arg=0xaaaaac088000) at dc/dc_mlx5.c:263
#20 0x0000fffff7d7e9ec in ucs_callbackq_dispatch (cbq=0xaaaaabfdd430) at /labhome/dmitrygla/ucx/src/ucs/datastruct/callbackq.h:211
#21 0x0000fffff7d89e7c in uct_worker_progress (worker=0xaaaaabfdd430) at /labhome/dmitrygla/ucx/src/uct/api/uct.h:2436
#22 ucp_worker_progress (worker=0xaaaaac1ba000) at core/ucp_worker.c:2407
#23 0x0000aaaaab0b84c0 in ucp_test_base::entity::progress (this=0xaaaaabfa7eb0, worker_index=0) at ucp/ucp_test.cc:870
#24 0x0000aaaaab0b3ca4 in ucp_test::progress (this=0xaaaaac40db10, worker_index=0) at ucp/ucp_test.cc:155
#25 0x0000aaaaab0b403c in ucp_test::request_process (this=0xaaaaac40db10, req=0xaaaaac552ec0, worker_index=0, wait=true) at ucp/ucp_test.cc:217
#26 0x0000aaaaab0b41d4 in ucp_test::request_wait (this=0xaaaaac40db10, req=0xaaaaac552ec0, worker_index=0) at ucp/ucp_test.cc:240
#27 0x0000aaaaab0b3d7c in ucp_test::flush_ep (this=0xaaaaac40db10, e=..., worker_index=0, ep_index=1) at ucp/ucp_test.cc:171
#28 0x0000aaaaaaf5ccb4 in test_ucp_peer_failure::do_test (this=0xaaaaac40db10, msg_size=1024, pre_msg_count=0, force_close=false, request_must_fail=true) at ucp/test_ucp_peer_failure.cc:321
#29 0x0000aaaaaaf5db0c in test_ucp_peer_failure_zcopy_Test::test_body (this=0xaaaaac40db10) at ucp/test_ucp_peer_failure.cc:419
#30 0x0000aaaaaad80b00 in ucs::test_base::run (this=0xaaaaac40db10) at common/test.cc:347
#31 0x0000aaaaaad80d30 in ucs::test_base::TestBodyProxy (this=0xaaaaac40db10) at common/test.cc:373
#32 0x0000aaaaaaf0429c in ucp_test::TestBody (this=0xaaaaac40db10) at ucp/ucp_test.h:190
#33 0x0000aaaaaad59114 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0xaaaaac40db90, method=&virtual testing::Test::TestBody(),
    location=0xaaaaab3311f0 "the test body") at common/gtest-all.cc:3562
#34 0x0000aaaaaad5290c in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0xaaaaac40db90, method=&virtual testing::Test::TestBody(),
    location=0xaaaaab3311f0 "the test body") at common/gtest-all.cc:3598
#35 0x0000aaaaaad371bc in testing::Test::Run (this=0xaaaaac40db90) at common/gtest-all.cc:3634
#36 0x0000aaaaaad37a14 in testing::TestInfo::Run (this=0xaaaaab8ea8f0) at common/gtest-all.cc:3812
#37 0x0000aaaaaad380c0 in testing::TestCase::Run (this=0xaaaaab8dfb90) at common/gtest-all.cc:3930
#38 0x0000aaaaaad3eae0 in testing::internal::UnitTestImpl::RunAllTests (this=0xaaaaab6a6f40) at common/gtest-all.cc:5808
#39 0x0000aaaaaad5a398 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0xaaaaab6a6f40,
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0xaaaaaad3e7c4 <testing::internal::UnitTestImpl::RunAllTests()>,
    location=0xaaaaab331ad8 "auxiliary test code (environments or event listeners)") at common/gtest-all.cc:3562
#40 0x0000aaaaaad5388c in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0xaaaaab6a6f40,
    method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0xaaaaaad3e7c4 <testing::internal::UnitTestImpl::RunAllTests()>,
    location=0xaaaaab331ad8 "auxiliary test code (environments or event listeners)") at common/gtest-all.cc:3598
#41 0x0000aaaaaad3d7a4 in testing::UnitTest::Run (this=0xaaaaab68f530 <testing::UnitTest::GetInstance()::instance>) at common/gtest-all.cc:5419
#42 0x0000aaaaaad66288 in RUN_ALL_TESTS () at common/gtest.h:20059
#43 0x0000aaaaaad66148 in main (argc=1, argv=0xffffffffee48) at common/main.cc:102
evgeny-leksikov commented 3 years ago

Reproduced again in https://github.com/openucx/ucx/pull/6262

21-03-11T08:21:25.8513755Z [ RUN      ] dcx/test_ucp_peer_failure.zcopy/2 <dc_x/tag_fail_imm>
2021-03-11T08:22:27.8833028Z /scrap/azure/agent-03/AZP_WORKSPACE/1/s/contrib/../test/gtest/ucp/test_ucp_peer_failure.cc:334: Failure
2021-03-11T08:22:27.8835410Z Value of: (m_err_status == status) || (UCS_ERR_CANCELED == status)
2021-03-11T08:22:27.8836787Z   Actual: false
2021-03-11T08:22:27.8837939Z Expected: true
2021-03-11T08:22:28.0697442Z [  FAILED  ] dcx/test_ucp_peer_failure.zcopy/2, where GetParam() = dc_x/tag_fail_imm (62212 ms)
2021-03-11T08:22:28.0708593Z [----------] 6 tests from dcx/test_ucp_peer_failure (71318 ms total)

https://dev.azure.com/ucfconsort/0b36e3f0-8ab9-4a48-b68b-4b2350e02c88/_apis/build/builds/14541/logs/266

dmitrygx commented 3 years ago

@evgeny-leksikov it seems to fail with another status

evgeny-leksikov commented 3 years ago

Yep