Open alinask opened 4 years ago
Could be related to the lack of 'ep_timeout' mechanism on the client after the server closed its ep with the FORCE mode.
another failure on PPC http://hpc-master.lab.mtl.com:8080/blue/rest/organizations/jenkins/pipelines/ucx/runs/7529/nodes/120/steps/910/log/?start=0
[2020-10-15T08:57:18.011Z] [ RUN ] rcx/test_ucp_sockaddr_destroy_ep_on_err.onesided_client_sforce/3 <rc_x>
[2020-10-15T08:57:18.011Z] [ INFO ] Testing 65.65.65.15:0
[2020-10-15T08:57:18.011Z] [ INFO ] server listening on 65.65.65.15:53807
[2020-10-15T08:57:28.047Z] /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:619: Failure
[2020-10-15T08:57:28.047Z] Expected: (UCS_INPROGRESS) != (status), actual: 1 vs 1
[2020-10-15T08:57:28.047Z] free not completed EP close request
[2020-10-15T09:12:35.426Z] /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/test_helpers.cc:53: Failure
[2020-10-15T09:12:35.426Z] Failed
[2020-10-15T09:12:35.426Z] Connection timed out - abort testing
[2020-10-15T09:12:35.426Z] [r-vmb-ppc-jenkins:10204:0:10204] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
[2020-10-15T09:12:35.426Z]
[2020-10-15T09:12:35.426Z] /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc: [ ucp_test::progress() ]
[2020-10-15T09:12:35.426Z] ...
[2020-10-15T09:12:35.426Z] 144 iter != entities().end(); ++iter)
[2020-10-15T09:12:35.426Z] 145 {
[2020-10-15T09:12:35.426Z] 146 count += (*iter)->progress(worker_index);
[2020-10-15T09:12:35.426Z] ==> 147 sched_yield();
[2020-10-15T09:12:35.426Z] 148 }
[2020-10-15T09:12:35.426Z] 149 return count;
[2020-10-15T09:12:35.426Z] 150 }
[2020-10-15T09:12:35.426Z]
[2020-10-15T09:12:35.426Z] ==== backtrace (tid: 10204) ====
[2020-10-15T09:12:35.426Z] 0 0x000000000005ece8 ucs_debug_print_backtrace() /scrap/jenkins/workspace/ucx-5/contrib/../src/ucs/debug/debug.c:656
[2020-10-15T09:12:35.426Z] 1 0x00000000106099a0 ucp_test::progress() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:147
[2020-10-15T09:12:35.426Z] 2 0x000000001060a12c ucp_test_base::entity::close_all_eps() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:640
[2020-10-15T09:12:35.426Z] 3 0x000000001060a2c4 ucp_test::disconnect() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:189
[2020-10-15T09:12:35.426Z] 4 0x000000001060a2c4 ucp_test_base::entity::get_num_workers() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:768
[2020-10-15T09:12:35.426Z] 5 0x000000001060a2c4 ucp_test::disconnect() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:179
[2020-10-15T09:12:35.426Z] 6 0x000000001060a370 ucp_test::cleanup() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:63
[2020-10-15T09:12:35.426Z] 7 0x00000000102070d0 ucs::test_base::TearDownProxy() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/test.cc:315
[2020-10-15T09:12:35.426Z] 8 0x00000000103e93d0 ucp_test::TearDown() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.h:174
[2020-10-15T09:12:35.426Z] 9 0x00000000101e2abc HandleSehExceptionsInMethodIfSupported<testing::Test, void>() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-10-15T09:12:35.426Z] 10 0x00000000101e2abc testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-10-15T09:12:35.426Z] 11 0x00000000101d1be4 testing::Test::Run() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3643
[2020-10-15T09:12:35.426Z] 12 0x00000000101d1dac testing::TestInfo::Run() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3812
[2020-10-15T09:12:35.426Z] 13 0x00000000101d1ff8 testing::TestCase::Run() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3930
[2020-10-15T09:12:35.426Z] 14 0x00000000101d7ff8 testing::internal::UnitTestImpl::RunAllTests() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:5808
[2020-10-15T09:12:35.426Z] 15 0x00000000101d8444 testing::internal::UnitTestImpl::RunAllTests() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:5725
[2020-10-15T09:12:35.426Z] 16 0x00000000101d8444 HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-10-15T09:12:35.426Z] 17 0x00000000101d8444 HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-10-15T09:12:35.426Z] 18 0x00000000101d8444 testing::UnitTest::Run() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:5422
[2020-10-15T09:12:35.426Z] 19 0x0000000010154688 RUN_ALL_TESTS() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest.h:20059
[2020-10-15T09:12:35.426Z] 20 0x0000000010154688 main() /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/main.cc:102
[2020-10-15T09:12:35.426Z] 21 0x0000000000025100 generic_start_main.isra.0() libc-start.c:0
[2020-10-15T09:12:35.426Z] 22 0x00000000000252f4 __libc_start_main() ???:0
[2020-10-15T09:12:35.426Z] =================================
I see there is a difference in output:
[ INFO ] ignoring error Connection reset by remote peer on endpoint
this line presence points that wireup uses CM based mechanism
[2020-11-10T12:05:41.726Z] [ RUN ] dc_ud/test_ucp_sockaddr_destroy_ep_on_err.onesided_client_sforce/6 <dc_x,ud_v,ud_x,mm/stream,cm>
[2020-11-10T12:05:41.983Z] [ INFO ] Testing 65.65.65.12:0
[2020-11-10T12:05:41.983Z] [ INFO ] server listening on 65.65.65.12:34269
[2020-11-10T12:05:52.040Z] /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:700: Failure
[2020-11-10T12:05:52.040Z] Expected: (UCS_INPROGRESS) != (status), actual: 1 vs 1
[2020-11-10T12:05:52.040Z] free not completed EP close request
[2020-11-10T12:06:59.749Z] [1605010011.787471] [r-vmb-ppc-jenkins:25210:0] ucp_test.cc:226 UCX ERROR request 0x100142cbea8 did not complete on time
[2020-11-10T12:20:52.156Z] /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/test_helpers.cc:53: Failure
[2020-11-10T12:20:52.156Z] Failed
[2020-11-10T12:20:52.156Z] Connection timed out - abort testing
[2020-11-10T12:20:53.524Z] [r-vmb-ppc-jenkins:25210:0:25210] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
[2020-11-10T12:20:53.781Z]
[2020-11-10T12:20:53.781Z] /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc: [ ucp_test::progress() ]
[2020-11-10T12:20:53.781Z] ...
[2020-11-10T12:20:53.781Z] 152 iter != entities().end(); ++iter)
[2020-11-10T12:20:53.781Z] 153 {
[2020-11-10T12:20:53.781Z] 154 count += (*iter)->progress(worker_index);
[2020-11-10T12:20:53.781Z] ==> 155 sched_yield();
[2020-11-10T12:20:53.781Z] 156 }
[2020-11-10T12:20:53.781Z] 157 return count;
[2020-11-10T12:20:53.781Z] 158 }
[2020-11-10T12:20:53.781Z]
[2020-11-10T12:20:54.343Z] ==== backtrace (tid: 25210) ====
[2020-11-10T12:20:54.343Z] 0 0x000000000005ed48 ucs_debug_print_backtrace() /scrap/jenkins/workspace/ucx-6/contrib/../src/ucs/debug/debug.c:656
[2020-11-10T12:20:54.343Z] 1 0x000000001060bc70 ucp_test::progress() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:155
[2020-11-10T12:20:54.343Z] 2 0x000000001060c46c ucp_test_base::entity::close_all_eps() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:721
[2020-11-10T12:20:54.343Z] 3 0x000000001060c604 ucp_test::disconnect() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:197
[2020-11-10T12:20:54.343Z] 4 0x000000001060c604 ucp_test_base::entity::get_num_workers() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:849
[2020-11-10T12:20:54.343Z] 5 0x000000001060c604 ucp_test::disconnect() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:187
[2020-11-10T12:20:54.343Z] 6 0x000000001060c6b0 ucp_test::cleanup() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:78
[2020-11-10T12:20:54.343Z] 7 0x000000001020a570 ucs::test_base::TearDownProxy() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/test.cc:315
[2020-11-10T12:20:54.343Z] 8 0x00000000103eab90 ucp_test::TearDown() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.h:190
[2020-11-10T12:20:54.343Z] 9 0x00000000101e5f5c HandleSehExceptionsInMethodIfSupported<testing::Test, void>() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-11-10T12:20:54.343Z] 10 0x00000000101e5f5c testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-11-10T12:20:54.343Z] 11 0x00000000101d5084 testing::Test::Run() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3643
[2020-11-10T12:20:54.343Z] 12 0x00000000101d524c testing::TestInfo::Run() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3812
[2020-11-10T12:20:54.343Z] 13 0x00000000101d5498 testing::TestCase::Run() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3930
[2020-11-10T12:20:54.343Z] 14 0x00000000101db498 testing::internal::UnitTestImpl::RunAllTests() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5808
[2020-11-10T12:20:54.343Z] 15 0x00000000101db8e4 testing::internal::UnitTestImpl::RunAllTests() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5725
[2020-11-10T12:20:54.343Z] 16 0x00000000101db8e4 HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-11-10T12:20:54.343Z] 17 0x00000000101db8e4 HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-11-10T12:20:54.343Z] 18 0x00000000101db8e4 testing::UnitTest::Run() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5422
[2020-11-10T12:20:54.343Z] 19 0x0000000010158298 RUN_ALL_TESTS() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest.h:20059
[2020-11-10T12:20:54.343Z] 20 0x0000000010158298 main() /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/main.cc:102
[2020-11-10T12:20:54.343Z] 21 0x0000000000025100 generic_start_main.isra.0() libc-start.c:0
[2020-11-10T12:20:54.343Z] 22 0x00000000000252f4 __libc_start_main() ???:0
[2020-11-10T12:20:54.343Z] =================================
on r-vmb-ppc-jenkins
node (pipeline-jenkins CI)
http://hpc-master.lab.mtl.com:8080/blue/organizations/jenkins/ucx/detail/ucx/7924/pipeline/660
http://hpc-master.lab.mtl.com:8080/blue/rest/organizations/jenkins/pipelines/ucx/runs/7924/nodes/660/steps/668/log/?start=0
2020-11-30T21:21:51.7231499Z [ RUN ] udx/test_ucp_sockaddr_destroy_ep_on_err.onesided_client_sforce/5 <ud_x/tag,cm,mt>
2020-11-30T21:21:51.7761004Z [ INFO ] Testing fe80::ee0d:9a03:8c:c90f:0
2020-11-30T21:21:51.7766050Z [ INFO ] server listening on fe80::ee0d:9a03:8c:c90f:60603
2020-11-30T21:22:01.7866224Z /scrap/azure/agent-15/AZP_WORKSPACE/1/s/contrib/../test/gtest/ucp/ucp_test.cc:700: Failure
2020-11-30T21:22:01.7867615Z Expected: (UCS_INPROGRESS) != (status), actual: 1 vs 1
2020-11-30T21:22:01.7868566Z free not completed EP close request: 0xaaaae281d128
2020-11-30T21:23:01.7877058Z [1606771381.787254] [swx-rdmz-ucx-arm-hwi:45330:0] ucp_test.cc:226 UCX ERROR request 0xaaaae281cfe8 did not complete on time
2020-11-30T21:37:01.7869067Z /scrap/azure/agent-15/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/test_helpers.cc:53: Failure
2020-11-30T21:37:01.7871160Z Failed
2020-11-30T21:37:01.7875173Z Connection timed out - abort testing
2020-11-30T21:37:01.7876872Z [swx-rdmz-ucx-arm-hwi:45330:0:45330] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
2020-11-30T21:37:01.7885490Z ==== backtrace (tid: 45330) ====
2020-11-30T21:37:01.7887344Z 0 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x10c) [0xffffbe23873c]
2020-11-30T21:37:01.7889388Z 1 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x2da78) [0xffffbe238a78]
2020-11-30T21:37:01.7891243Z 2 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x2db90) [0xffffbe238b90]
2020-11-30T21:37:01.7892802Z 3 linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffffbe2bd688]
2020-11-30T21:37:01.7894144Z 4 /lib/aarch64-linux-gnu/libc.so.6(sched_yield+0x8) [0xffffbdc1d348]
2020-11-30T21:37:01.7895596Z 5 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x610b20) [0xaaaad2e39b20]
2020-11-30T21:37:01.7897089Z 6 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x6112cc) [0xaaaad2e3a2cc]
2020-11-30T21:37:01.7898584Z 7 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x61139c) [0xaaaad2e3a39c]
2020-11-30T21:37:01.7900109Z 8 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2eb224) [0xaaaad2b14224]
2020-11-30T21:37:01.7901585Z 9 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2cde64) [0xaaaad2af6e64]
2020-11-30T21:37:01.7903078Z 10 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2c57d0) [0xaaaad2aee7d0]
2020-11-30T21:37:01.7904595Z 11 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2c5ab0) [0xaaaad2aeeab0]
2020-11-30T21:37:01.7906321Z 12 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2c6844) [0xaaaad2aef844]
2020-11-30T21:37:01.7907852Z 13 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2c6b2c) [0xaaaad2aefb2c]
2020-11-30T21:37:01.7909332Z 14 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x25fa70) [0xaaaad2a88a70]
2020-11-30T21:37:01.7910770Z 15 /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe0) [0xffffbdb836e0]
2020-11-30T21:37:01.7912204Z 16 /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2aa5bc) [0xaaaad2ad35bc]
2020-11-30T21:37:01.7913058Z =================================
2020-11-30T21:37:01.7914190Z [swx-rdmz-ucx-arm-hwi:45330:0:45330] Process frozen...
Link to job: https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=11386&view=logs&j=91485d7f-07f8-5dc4-be43-59794e62e572&t=9aca67d2-2f0a-5509-6a19-ed5514a50604 Link to log: https://dev.azure.com/ucfconsort/0b36e3f0-8ab9-4a48-b68b-4b2350e02c88/_apis/build/builds/11386/logs/155
First occurence taken from https://github.com/openucx/ucx/issues/5192#issuecomment-633529325 :
Can be related (http://hpc-master.lab.mtl.com:8080/blue/rest/organizations/jenkins/pipelines/ucx/runs/5059/nodes/121/steps/1266/log/?start=0)
Second one from https://github.com/openucx/ucx/pull/5327#issuecomment-649714574 :
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=7203&view=logs&j=ed184461-b056-56e5-1bb9-ebb6cdb98231&t=1ea5f585-9512-55e8-ca53-51629b0a85a7