openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.16k stars 427 forks source link

Segfault in UCX master gtest on Bluefield-2 #6779

Open shamisp opened 3 years ago

shamisp commented 3 years ago

Describe the bug

Segfault in gtest

[----------] 2 tests from cma/test_ucp_peer_failure_rma_zcopy
[ RUN      ] cma/test_ucp_peer_failure_rma_zcopy.put/0 <cma/memory>
[       OK ] cma/test_ucp_peer_failure_rma_zcopy.put/0 (0 ms)
[ RUN      ] cma/test_ucp_peer_failure_rma_zcopy.get/0 <cma/memory>
[       OK ] cma/test_ucp_peer_failure_rma_zcopy.get/0 (0 ms)
[----------] 2 tests from cma/test_ucp_peer_failure_rma_zcopy (0 ms total)

[----------] 44 tests from rc_mlx5/test_tag
[ RUN      ] rc_mlx5/test_tag.tag_eager_short_expected/0 <rc_mlx5/mlx5_0:1>
[localhost:28382:0:28382] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
[localhost:28382:0:28382] Process frozen, press Enter to attach a debugger...

Steps to Reproduce

I was able consistently reproduce the problem with gtest


- make -c test/gtest test
- ucx master (ff33f4304b3be98001e901b1a988faf2ae691be9)
- `../configure --prefix /home/ubuntu/ucx_install --enable-gtest`

### Setup and versions
- Bluefield-2 with `DOCA_v1.0_BlueField_OS_Ubuntu_20.04-5.3-1.0.0.0-3.6.0.11699-1-aarch64`
- Port 1 wired to port 2. Both ports are in IB mode.
shamisp commented 3 years ago

Adding backtrace:

Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000ffff99ab1dec in __GI___wait4 (pid=pid@entry=1324555, stat_loc=stat_loc@entry=0xffff9974e284, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
(gdb) bt
#0  0x0000ffff99ab1dec in __GI___wait4 (pid=pid@entry=1324555, stat_loc=stat_loc@entry=0xffff9974e284, options=options@entry=0, usage=usage@entry=0x0)
    at ../sysdeps/unix/sysv/linux/wait4.c:27
#1  0x0000ffff99ab1d58 in __GI___waitpid (pid=pid@entry=1324555, stat_loc=stat_loc@entry=0xffff9974e284, options=options@entry=0) at waitpid.c:38
#2  0x0000ffff9a14578c in ucs_debugger_attach () at ../../../src/ucs/debug/debug.c:813
#3  0x0000ffff9a14861c in ucs_error_freeze (message=0xffff9a1611b0 "address not mapped to object") at ../../../src/ucs/debug/debug.c:908
#4  ucs_handle_error (message=0xffff9a1611b0 "address not mapped to object") at ../../../src/ucs/debug/debug.c:1078
#5  ucs_handle_error (message=0xffff9a1611b0 "address not mapped to object") at ../../../src/ucs/debug/debug.c:1066
#6  0x0000ffff9a14876c in ucs_debug_handle_error_signal (signo=signo@entry=11, cause=0xffff9a1611b0 "address not mapped to object", fmt=fmt@entry=0xffff9a1612a0 " at address %p")
    at ../../../src/ucs/debug/debug.c:1027
#7  0x0000ffff9a148b14 in ucs_error_signal_handler (signo=11, info=0xffff9974e7a0, context=<optimized out>) at ../../../src/ucs/debug/debug.c:1063
#8  <signal handler called>
#9  0x0000aaaad2eba5b8 in test_tag::unexp_eager (arg=0xaaaafedb2450, data=0xaaab28bea950, length=220, flags=3, stag=11, imm=0, context=<optimized out>)
    at ../../../test/gtest/uct/test_tag.cc:392
#10 0x0000ffff99f097ec in uct_rc_mlx5_iface_tag_handle_unexp (poll_flags=3, byte_len=236, cqe=<optimized out>, iface=0xaaab1b7c4fe0) at ../../../../src/uct/ib/rc/accel/rc_mlx5.inl:1298
#11 uct_rc_mlx5_iface_common_poll_rx (poll_flags=3, iface=0xaaab1b7c4fe0) at ../../../../src/uct/ib/rc/accel/rc_mlx5.inl:1462
#12 uct_rc_mlx5_iface_progress (flags=3, arg=0xaaab1b7c4fe0) at ../../../../src/uct/ib/rc/accel/rc_mlx5_iface.c:173
#13 uct_rc_mlx5_iface_progress_tm (arg=0xaaab1b7c4fe0) at ../../../../src/uct/ib/rc/accel/rc_mlx5_iface.c:194
#14 0x0000aaaad2e7e788 in ucs_callbackq_dispatch (cbq=<optimized out>) at /home/ubuntu/ucx/build/../src/ucs/datastruct/callbackq.h:211
#15 uct_worker_progress (worker=<optimized out>) at /home/ubuntu/ucx/build/../src/uct/api/uct.h:2592
#16 uct_test::entity::progress (this=0xaaab0163bfa0) at ../../../test/gtest/uct/uct_test.cc:1025
#17 uct_test::progress (this=0xaaaafedb2450) at ../../../test/gtest/uct/uct_test.cc:753
#18 uct_test::short_progress_loop (this=0xaaaafedb2450, delay_ms=1) at ../../../test/gtest/uct/uct_test.cc:780
#19 0x0000aaaad2ebf7dc in uct_test::wait_for_flag<bool> (timeout=30, flag=0xffffcc824929, this=0xaaaafedb2450) at /home/ubuntu/ucx/build/../src/ucs/time/time.h:75
#20 test_tag::test_tag_expected (this=0xaaaafedb2450, sfunc=<optimized out>, length=<optimized out>, is_sw_rndv=false) at ../../../test/gtest/uct/test_tag.cc:245
#21 0x0000aaaad2d69474 in ucs::test_base::run (this=0xaaaafedb2468) at ../../../test/gtest/common/test.cc:365
#22 ucs::test_base::run (this=0xaaaafedb2468) at ../../../test/gtest/common/test.cc:362
#23 0x0000aaaad2d6a98c in ucs::test_base::TestBodyProxy (this=0xaaaafedb2468) at ../../../test/gtest/common/test.cc:391
#24 0x0000aaaad2d50d38 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (location=0xaaaad32e3860 "the test body", method=<optimized out>, 
    object=0xaaaafedb2450) at ../../../test/gtest/common/gtest-all.cc:3543
#25 testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=object@entry=0xaaaafedb2450, method=<optimized out>, 
    location=location@entry=0xaaaad32e3860 "the test body") at ../../../test/gtest/common/gtest-all.cc:3598
#26 0x0000aaaad2d4702c in testing::Test::Run (this=this@entry=0xaaaafedb2450) at ../../../test/gtest/common/gtest-all.cc:3634
#27 0x0000aaaad2d473b8 in testing::Test::Run (this=0xaaaafedb2450) at ../../../test/gtest/common/gtest-all.cc:3812
#28 testing::TestInfo::Run (this=0xaaaafe111f10) at ../../../test/gtest/common/gtest-all.cc:3812
#29 0x0000aaaad2d475b0 in testing::TestInfo::Run (this=<optimized out>) at ../../../test/gtest/common/gtest-all.cc:3787
#30 testing::TestCase::Run (this=0xaaaafe1120a0) at ../../../test/gtest/common/gtest-all.cc:3930
#31 0x0000aaaad2d48508 in testing::TestCase::Run (this=<optimized out>) at ../../../test/gtest/common/gtest-all.cc:3916
#32 testing::internal::UnitTestImpl::RunAllTests (this=this@entry=0xaaaafdf6da00) at ../../../test/gtest/common/gtest-all.cc:5808
#33 0x0000aaaad2d487d0 in testing::internal::UnitTestImpl::RunAllTests (this=0xaaaafdf6da00) at ../../../test/gtest/common/gtest-all.cc:5725
#34 testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0xaaaad32e39e0 "auxiliary test code (environments or event listeners)", 
    method=<optimized out>, object=0xaaaafdf6da00) at ../../../test/gtest/common/gtest-all.cc:3562
#35 testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (location=0xaaaad32e39e0 "auxiliary test code (environments or event listeners)", 
    method=<optimized out>, object=0xaaaafdf6da00) at ../../../test/gtest/common/gtest-all.cc:3598
#36 testing::UnitTest::Run (this=<optimized out>) at ../../../test/gtest/common/gtest-all.cc:5419
#37 0x0000aaaad2ce2cf4 in RUN_ALL_TESTS () at ../../../test/gtest/common/gtest.h:20059
#38 main (argc=<optimized out>, argv=<optimized out>) at ../../../test/gtest/common/main.cc:109
yosefe commented 3 years ago

@shamisp pls fill system details according to issue template, in particular MLNX_OFED and FW versions

shamisp commented 3 years ago

@yosefe it is DOCA_v1.0_BlueField_OS_Ubuntu_20.04-5.3-1.0.0.0-3.6.0.11699-1-aarch64 - this is Mellanox OS ID string that is used to identify bundle of MOFED+FW+BIOS (pretty much everything on BF2).

shamisp commented 3 years ago

This DOCA image bundles MLNX_OFED_LINUX-5.3-1.0.0.0 (OFED-5.3-1.0.0) + 24.30.1004. The setup is identified through cat /etc/mlnx-release. Apparently, you cannot just update MOFED on DOCA image. At least I was strongly advised not do it.

shamisp commented 3 years ago

If I abort gtest post failure, the dmesg is full of:

[52096.327716] mlx5_core 0000:03:00.0: wait_func_handle_exec_timeout:1099:(pid 28382): cmd[0]: DESTROY_MKEY(0x202) No done completion
[52096.327725] mlx5_core 0000:03:00.0: wait_func:1126:(pid 28382): DESTROY_MKEY(0x202) timeout. Will cause a leak of a command resource
[52161.852362] mlx5_core 0000:03:00.0: wait_func_handle_exec_timeout:1099:(pid 28382): cmd[1]: DESTROY_QP(0x501) No done completion
[52161.852372] mlx5_core 0000:03:00.0: wait_func:1126:(pid 28382): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource
[52227.377069] mlx5_core 0000:03:00.0: wait_func_handle_exec_timeout:1099:(pid 28382): cmd[2]: DESTROY_QP(0x501) No done completion
[52227.377079] mlx5_core 0000:03:00.0: wait_func:1126:(pid 28382): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource
[52292.901784] mlx5_core 0000:03:00.0: wait_func_handle_exec_timeout:1099:(pid 28382): cmd[3]: DESTROY_QP(0x501) No done completion
[52292.901793] mlx5_core 0000:03:00.0: wait_func:1126:(pid 28382): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource
[52358.430405] mlx5_core 0000:03:00.0: wait_func_handle_exec_timeout:1099:(pid 28382): cmd[4]: DESTROY_MKEY(0x202) No done completion
[52358.430415] mlx5_core 0000:03:00.0: wait_func:1126:(pid 28382): DESTROY_MKEY(0x202) timeout. Will cause a leak of a command resource
[52423.951095] mlx5_core 0000:03:00.0: wait_func_handle_exec_timeout:1099:(pid 28382): cmd[5]: DESTROY_QP(0x501) No done completion
[52423.951105] mlx5_core 0000:03:00.0: wait_func:1126:(pid 28382): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource
[52489.475777] mlx5_core 0000:03:00.0: wait_func_handle_exec_timeout:1099:(pid 28382): cmd[6]: DESTROY_QP(0x501) No done completion
[52489.475787] mlx5_core 0000:03:00.0: wait_func:1126:(pid 28382): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource
[52555.000453] mlx5_core 0000:03:00.0: wait_func_handle_exec_timeout:1099:(pid 28382): cmd[7]: DESTROY_QP(0x501) No done completion
[52555.000462] mlx5_core 0000:03:00.0: wait_func:1126:(pid 28382): DESTROY_QP(0x501) timeout. Will cause a leak of a command resource

The system becomes unstable post-failure and I had to reboot the nic.