openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.14k stars 424 forks source link

failure in uct_p2p_rma_test.madvise and cma/test_md.fork #5509

Open yosefe opened 4 years ago

yosefe commented 4 years ago

http://hpc-master.lab.mtl.com:8080/blue/rest/organizations/jenkins/pipelines/ucx/runs/6422/nodes/123/steps/1719/log/?start=0

[2020-07-31T23:45:32.123Z] [ RUN      ] rc_verbs/uct_p2p_rma_test.madvise/2 <rc_verbs/mlx5_1:1>
[2020-07-31T23:45:32.123Z] [     INFO ] Testing component: ib
[2020-07-31T23:45:32.123Z] /scrap/jenkins/workspace/ucx-3/contrib/../test/gtest/uct/test_p2p_rma.cc:147: Failure
[2020-07-31T23:45:32.123Z] Value of: system(cmd_str)
[2020-07-31T23:45:32.123Z]   Actual: 32512
[2020-07-31T23:45:32.123Z] Expected: 0
[2020-07-31T23:45:32.123Z] [  FAILED  ] rc_verbs/uct_p2p_rma_test.madvise/2, where GetParam() = rc_verbs/mlx5_1:1 (70 ms)
[2020-07-31T23:45:32.123Z] [ RUN      ] rc_verbs/uct_p2p_rma_test.get_bcopy/2 <rc_verbs/mlx5_1:1>
[2020-07-31T23:45:32.123Z] [     INFO ] Testing component: ib
...
[2020-08-01T00:04:56.720Z] [ RUN      ] cma/test_md.fork/0 <cma>
[2020-08-01T00:04:56.720Z] [r-vmb-ppc-jenkins:8738 :0:25812] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x10021ff80e8)
[2020-08-01T00:04:56.976Z] /scrap/jenkins/workspace/ucx-3/contrib/../test/gtest/uct/test_md.cc:523: Failure
[2020-08-01T00:04:56.976Z] Value of: WIFEXITED(thread_status)
[2020-08-01T00:04:56.976Z]   Actual: false
[2020-08-01T00:04:56.976Z] Expected: true
[2020-08-01T00:04:56.976Z] [  FAILED  ] cma/test_md.fork/0, where GetParam() = cma (250 ms)
...
[2020-08-01T00:10:00.243Z] Skipped tests: count - 1214, time - 243619 ms
[2020-08-01T00:10:00.243Z] make: *** [test] Error 1
[2020-08-01T00:10:00.243Z] make: Leaving directory `/scrap/jenkins/workspace/ucx-3/build-test/test/gtest'
alinask commented 4 years ago

reproduced with xpmem as well:

[ RUN      ] xpmem/test_md.fork/0 <xpmem>
/scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test_helpers.cc:49: Failure
Failed
Connection timed out - abort testing
[swx-rdmz-ucx-legacy-02:44809:0:44809] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
==== backtrace (tid:  44809) ====
 0 0x000000000005b948 ucs_debug_print_backtrace()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../src/ucs/debug/debug.c:656
 1 0x000000000000f0e4 __libc_wait()  :0
 2 0x000000000062dcfb test_md_fork_Test::test_body()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/uct/test_md.cc:522
 3 0x000000000062dcfb test_md_fork_Test::test_body()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/uct/test_md.cc:522
 4 0x00000000005b8096 ucs::test_base::run()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test.cc:298
 5 0x00000000005b8096 ucs::test_base::TestBodyProxy()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/test.cc:324
 6 0x000000000059b1c9 HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:3562
 7 0x000000000058f40d testing::Test::Run()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:3635
 8 0x000000000058f4e5 testing::TestInfo::Run()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:3812
 9 0x000000000058f64f testing::TestCase::Run()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:3930
10 0x0000000000593dfd testing::internal::UnitTestImpl::RunAllTests()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:5808
11 0x00000000005940e0 testing::internal::UnitTestImpl::RunAllTests()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest-all.cc:5725
12 0x00000000005304a8 RUN_ALL_TESTS()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/gtest.h:20059
13 0x00000000005304a8 main()  /scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/common/main.cc:102
14 0x00000000000223d5 __libc_start_main()  ???:0
15 0x000000000057977f _start()  ???:0
=================================
[swx-rdmz-ucx-legacy-02:44809:0:44809] Process frozen...
alinask commented 4 years ago

with IB:

[ RUN      ] ib/test_md.fork/2 <mlx5_2>
[swx-rdmz-ucx-new-02:19051:0:13479] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x41612e8)
/scrap/azure/agent-07/AZP_WORKSPACE/2/s/contrib/../test/gtest/uct/test_md.cc:523: Failure
Value of: WIFEXITED(thread_status)
  Actual: false
Expected: true
[  FAILED  ] ib/test_md.fork/2, where GetParam() = mlx5_2 (39 ms)