openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 427 forks source link

failure on test_ucp_sockaddr_destroy_ep_on_err.onesided tests #5337

Open alinask opened 4 years ago

alinask commented 4 years ago

First occurence taken from https://github.com/openucx/ucx/issues/5192#issuecomment-633529325 :

Can be related (http://hpc-master.lab.mtl.com:8080/blue/rest/organizations/jenkins/pipelines/ucx/runs/5059/nodes/121/steps/1266/log/?start=0)

[2020-05-25T08:55:58.408Z] [ RUN      ] dc_ud/test_ucp_sockaddr.onesided_disconnect_c2s/3 <dc_x,ud_v,ud_x,mm>
[2020-05-25T08:55:58.692Z] [     INFO ] Testing 65.65.65.12:0
[2020-05-25T08:55:58.692Z] [     INFO ] server listening on 65.65.65.12:55375
[2020-05-25T08:55:58.692Z] [     INFO ] ignoring error Connection reset by remote peer on endpoint 0x3fff8e880000
[2020-05-25T08:56:08.746Z] /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:605: Failure
[2020-05-25T08:56:08.746Z] Expected: (UCS_INPROGRESS) != (status), actual: 1 vs 1
[2020-05-25T08:56:08.746Z] free not completed EP close request
[2020-05-25T09:01:00.532Z] [  FAILED  ] dc_ud/test_ucp_sockaddr.onesided_disconnect_c2s/3, where GetParam() = dc_x,ud_v,ud_x,mm (296134 ms)

Second one from https://github.com/openucx/ucx/pull/5327#issuecomment-649714574 :

[ RUN      ] shm_ib/test_ucp_sockaddr_destroy_ep_on_err.onesided_c2s_sforce/8 <shm,ib>
[     INFO ] Testing fe80::268a:7ff:fe8d:af27:0
[     INFO ] server listening on fe80::268a:7ff:fe8d:af27:58404
/scrap/azure/agent-02/AZP_WORKSPACE/2/s/contrib/../test/gtest/ucp/ucp_test.cc:599: Failure
Expected: (UCS_INPROGRESS) != (status), actual: 1 vs 1
free not completed EP close request
[  FAILED  ] shm_ib/test_ucp_sockaddr_destroy_ep_on_err.onesided_c2s_sforce/8, where GetParam() = shm,ib (14399 ms)

https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=7203&view=logs&j=ed184461-b056-56e5-1bb9-ebb6cdb98231&t=1ea5f585-9512-55e8-ca53-51629b0a85a7

alinask commented 4 years ago

Could be related to the lack of 'ep_timeout' mechanism on the client after the server closed its ep with the FORCE mode.

evgeny-leksikov commented 4 years ago

another failure on PPC http://hpc-master.lab.mtl.com:8080/blue/rest/organizations/jenkins/pipelines/ucx/runs/7529/nodes/120/steps/910/log/?start=0

[2020-10-15T08:57:18.011Z] [ RUN      ] rcx/test_ucp_sockaddr_destroy_ep_on_err.onesided_client_sforce/3 <rc_x>
[2020-10-15T08:57:18.011Z] [     INFO ] Testing 65.65.65.15:0
[2020-10-15T08:57:18.011Z] [     INFO ] server listening on 65.65.65.15:53807
[2020-10-15T08:57:28.047Z] /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:619: Failure
[2020-10-15T08:57:28.047Z] Expected: (UCS_INPROGRESS) != (status), actual: 1 vs 1
[2020-10-15T08:57:28.047Z] free not completed EP close request
[2020-10-15T09:12:35.426Z] /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/test_helpers.cc:53: Failure
[2020-10-15T09:12:35.426Z] Failed
[2020-10-15T09:12:35.426Z] Connection timed out - abort testing
[2020-10-15T09:12:35.426Z] [r-vmb-ppc-jenkins:10204:0:10204] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
[2020-10-15T09:12:35.426Z] 
[2020-10-15T09:12:35.426Z] /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc: [ ucp_test::progress() ]
[2020-10-15T09:12:35.426Z]       ...
[2020-10-15T09:12:35.426Z]       144          iter != entities().end(); ++iter)
[2020-10-15T09:12:35.426Z]       145     {
[2020-10-15T09:12:35.426Z]       146         count += (*iter)->progress(worker_index);
[2020-10-15T09:12:35.426Z] ==>   147         sched_yield();
[2020-10-15T09:12:35.426Z]       148     }
[2020-10-15T09:12:35.426Z]       149     return count;
[2020-10-15T09:12:35.426Z]       150 }
[2020-10-15T09:12:35.426Z] 
[2020-10-15T09:12:35.426Z] ==== backtrace (tid:  10204) ====
[2020-10-15T09:12:35.426Z]  0 0x000000000005ece8 ucs_debug_print_backtrace()  /scrap/jenkins/workspace/ucx-5/contrib/../src/ucs/debug/debug.c:656
[2020-10-15T09:12:35.426Z]  1 0x00000000106099a0 ucp_test::progress()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:147
[2020-10-15T09:12:35.426Z]  2 0x000000001060a12c ucp_test_base::entity::close_all_eps()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:640
[2020-10-15T09:12:35.426Z]  3 0x000000001060a2c4 ucp_test::disconnect()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:189
[2020-10-15T09:12:35.426Z]  4 0x000000001060a2c4 ucp_test_base::entity::get_num_workers()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:768
[2020-10-15T09:12:35.426Z]  5 0x000000001060a2c4 ucp_test::disconnect()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:179
[2020-10-15T09:12:35.426Z]  6 0x000000001060a370 ucp_test::cleanup()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.cc:63
[2020-10-15T09:12:35.426Z]  7 0x00000000102070d0 ucs::test_base::TearDownProxy()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/test.cc:315
[2020-10-15T09:12:35.426Z]  8 0x00000000103e93d0 ucp_test::TearDown()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/ucp/ucp_test.h:174
[2020-10-15T09:12:35.426Z]  9 0x00000000101e2abc HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-10-15T09:12:35.426Z] 10 0x00000000101e2abc testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-10-15T09:12:35.426Z] 11 0x00000000101d1be4 testing::Test::Run()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3643
[2020-10-15T09:12:35.426Z] 12 0x00000000101d1dac testing::TestInfo::Run()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3812
[2020-10-15T09:12:35.426Z] 13 0x00000000101d1ff8 testing::TestCase::Run()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3930
[2020-10-15T09:12:35.426Z] 14 0x00000000101d7ff8 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:5808
[2020-10-15T09:12:35.426Z] 15 0x00000000101d8444 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:5725
[2020-10-15T09:12:35.426Z] 16 0x00000000101d8444 HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-10-15T09:12:35.426Z] 17 0x00000000101d8444 HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-10-15T09:12:35.426Z] 18 0x00000000101d8444 testing::UnitTest::Run()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest-all.cc:5422
[2020-10-15T09:12:35.426Z] 19 0x0000000010154688 RUN_ALL_TESTS()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/gtest.h:20059
[2020-10-15T09:12:35.426Z] 20 0x0000000010154688 main()  /scrap/jenkins/workspace/ucx-5/contrib/../test/gtest/common/main.cc:102
[2020-10-15T09:12:35.426Z] 21 0x0000000000025100 generic_start_main.isra.0()  libc-start.c:0
[2020-10-15T09:12:35.426Z] 22 0x00000000000252f4 __libc_start_main()  ???:0
[2020-10-15T09:12:35.426Z] =================================

I see there is a difference in output:

[     INFO ] ignoring error Connection reset by remote peer on endpoint

this line presence points that wireup uses CM based mechanism

dmitrygx commented 3 years ago
[2020-11-10T12:05:41.726Z] [ RUN      ] dc_ud/test_ucp_sockaddr_destroy_ep_on_err.onesided_client_sforce/6 <dc_x,ud_v,ud_x,mm/stream,cm>
[2020-11-10T12:05:41.983Z] [     INFO ] Testing 65.65.65.12:0
[2020-11-10T12:05:41.983Z] [     INFO ] server listening on 65.65.65.12:34269
[2020-11-10T12:05:52.040Z] /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:700: Failure
[2020-11-10T12:05:52.040Z] Expected: (UCS_INPROGRESS) != (status), actual: 1 vs 1
[2020-11-10T12:05:52.040Z] free not completed EP close request
[2020-11-10T12:06:59.749Z] [1605010011.787471] [r-vmb-ppc-jenkins:25210:0]      ucp_test.cc:226  UCX  ERROR request 0x100142cbea8 did not complete on time
[2020-11-10T12:20:52.156Z] /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/test_helpers.cc:53: Failure
[2020-11-10T12:20:52.156Z] Failed
[2020-11-10T12:20:52.156Z] Connection timed out - abort testing
[2020-11-10T12:20:53.524Z] [r-vmb-ppc-jenkins:25210:0:25210] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
[2020-11-10T12:20:53.781Z] 
[2020-11-10T12:20:53.781Z] /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc: [ ucp_test::progress() ]
[2020-11-10T12:20:53.781Z]       ...
[2020-11-10T12:20:53.781Z]       152          iter != entities().end(); ++iter)
[2020-11-10T12:20:53.781Z]       153     {
[2020-11-10T12:20:53.781Z]       154         count += (*iter)->progress(worker_index);
[2020-11-10T12:20:53.781Z] ==>   155         sched_yield();
[2020-11-10T12:20:53.781Z]       156     }
[2020-11-10T12:20:53.781Z]       157     return count;
[2020-11-10T12:20:53.781Z]       158 }
[2020-11-10T12:20:53.781Z] 
[2020-11-10T12:20:54.343Z] ==== backtrace (tid:  25210) ====
[2020-11-10T12:20:54.343Z]  0 0x000000000005ed48 ucs_debug_print_backtrace()  /scrap/jenkins/workspace/ucx-6/contrib/../src/ucs/debug/debug.c:656
[2020-11-10T12:20:54.343Z]  1 0x000000001060bc70 ucp_test::progress()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:155
[2020-11-10T12:20:54.343Z]  2 0x000000001060c46c ucp_test_base::entity::close_all_eps()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:721
[2020-11-10T12:20:54.343Z]  3 0x000000001060c604 ucp_test::disconnect()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:197
[2020-11-10T12:20:54.343Z]  4 0x000000001060c604 ucp_test_base::entity::get_num_workers()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:849
[2020-11-10T12:20:54.343Z]  5 0x000000001060c604 ucp_test::disconnect()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:187
[2020-11-10T12:20:54.343Z]  6 0x000000001060c6b0 ucp_test::cleanup()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.cc:78
[2020-11-10T12:20:54.343Z]  7 0x000000001020a570 ucs::test_base::TearDownProxy()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/test.cc:315
[2020-11-10T12:20:54.343Z]  8 0x00000000103eab90 ucp_test::TearDown()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/ucp/ucp_test.h:190
[2020-11-10T12:20:54.343Z]  9 0x00000000101e5f5c HandleSehExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-11-10T12:20:54.343Z] 10 0x00000000101e5f5c testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-11-10T12:20:54.343Z] 11 0x00000000101d5084 testing::Test::Run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3643
[2020-11-10T12:20:54.343Z] 12 0x00000000101d524c testing::TestInfo::Run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3812
[2020-11-10T12:20:54.343Z] 13 0x00000000101d5498 testing::TestCase::Run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3930
[2020-11-10T12:20:54.343Z] 14 0x00000000101db498 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5808
[2020-11-10T12:20:54.343Z] 15 0x00000000101db8e4 testing::internal::UnitTestImpl::RunAllTests()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5725
[2020-11-10T12:20:54.343Z] 16 0x00000000101db8e4 HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3562
[2020-11-10T12:20:54.343Z] 17 0x00000000101db8e4 HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:3598
[2020-11-10T12:20:54.343Z] 18 0x00000000101db8e4 testing::UnitTest::Run()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest-all.cc:5422
[2020-11-10T12:20:54.343Z] 19 0x0000000010158298 RUN_ALL_TESTS()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/gtest.h:20059
[2020-11-10T12:20:54.343Z] 20 0x0000000010158298 main()  /scrap/jenkins/workspace/ucx-6/contrib/../test/gtest/common/main.cc:102
[2020-11-10T12:20:54.343Z] 21 0x0000000000025100 generic_start_main.isra.0()  libc-start.c:0
[2020-11-10T12:20:54.343Z] 22 0x00000000000252f4 __libc_start_main()  ???:0
[2020-11-10T12:20:54.343Z] =================================

on r-vmb-ppc-jenkins node (pipeline-jenkins CI) http://hpc-master.lab.mtl.com:8080/blue/organizations/jenkins/ucx/detail/ucx/7924/pipeline/660 http://hpc-master.lab.mtl.com:8080/blue/rest/organizations/jenkins/pipelines/ucx/runs/7924/nodes/660/steps/668/log/?start=0

dmitrygx commented 3 years ago
2020-11-30T21:21:51.7231499Z [ RUN      ] udx/test_ucp_sockaddr_destroy_ep_on_err.onesided_client_sforce/5 <ud_x/tag,cm,mt>
2020-11-30T21:21:51.7761004Z [     INFO ] Testing fe80::ee0d:9a03:8c:c90f:0
2020-11-30T21:21:51.7766050Z [     INFO ] server listening on fe80::ee0d:9a03:8c:c90f:60603
2020-11-30T21:22:01.7866224Z /scrap/azure/agent-15/AZP_WORKSPACE/1/s/contrib/../test/gtest/ucp/ucp_test.cc:700: Failure
2020-11-30T21:22:01.7867615Z Expected: (UCS_INPROGRESS) != (status), actual: 1 vs 1
2020-11-30T21:22:01.7868566Z free not completed EP close request: 0xaaaae281d128
2020-11-30T21:23:01.7877058Z [1606771381.787254] [swx-rdmz-ucx-arm-hwi:45330:0]      ucp_test.cc:226  UCX  ERROR request 0xaaaae281cfe8 did not complete on time
2020-11-30T21:37:01.7869067Z /scrap/azure/agent-15/AZP_WORKSPACE/1/s/contrib/../test/gtest/common/test_helpers.cc:53: Failure
2020-11-30T21:37:01.7871160Z Failed
2020-11-30T21:37:01.7875173Z Connection timed out - abort testing
2020-11-30T21:37:01.7876872Z [swx-rdmz-ucx-arm-hwi:45330:0:45330] Caught signal 6 (Aborted: tkill(2) or tgkill(2))
2020-11-30T21:37:01.7885490Z ==== backtrace (tid:  45330) ====
2020-11-30T21:37:01.7887344Z  0  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/src/ucs/.libs/libucs.so.0(ucs_handle_error+0x10c) [0xffffbe23873c]
2020-11-30T21:37:01.7889388Z  1  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x2da78) [0xffffbe238a78]
2020-11-30T21:37:01.7891243Z  2  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/src/ucs/.libs/libucs.so.0(+0x2db90) [0xffffbe238b90]
2020-11-30T21:37:01.7892802Z  3  linux-vdso.so.1(__kernel_rt_sigreturn+0) [0xffffbe2bd688]
2020-11-30T21:37:01.7894144Z  4  /lib/aarch64-linux-gnu/libc.so.6(sched_yield+0x8) [0xffffbdc1d348]
2020-11-30T21:37:01.7895596Z  5  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x610b20) [0xaaaad2e39b20]
2020-11-30T21:37:01.7897089Z  6  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x6112cc) [0xaaaad2e3a2cc]
2020-11-30T21:37:01.7898584Z  7  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x61139c) [0xaaaad2e3a39c]
2020-11-30T21:37:01.7900109Z  8  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2eb224) [0xaaaad2b14224]
2020-11-30T21:37:01.7901585Z  9  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2cde64) [0xaaaad2af6e64]
2020-11-30T21:37:01.7903078Z 10  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2c57d0) [0xaaaad2aee7d0]
2020-11-30T21:37:01.7904595Z 11  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2c5ab0) [0xaaaad2aeeab0]
2020-11-30T21:37:01.7906321Z 12  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2c6844) [0xaaaad2aef844]
2020-11-30T21:37:01.7907852Z 13  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2c6b2c) [0xaaaad2aefb2c]
2020-11-30T21:37:01.7909332Z 14  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x25fa70) [0xaaaad2a88a70]
2020-11-30T21:37:01.7910770Z 15  /lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0xe0) [0xffffbdb836e0]
2020-11-30T21:37:01.7912204Z 16  /scrap/azure/agent-15/AZP_WORKSPACE/1/s/build-test/test/gtest/gtest(+0x2aa5bc) [0xaaaad2ad35bc]
2020-11-30T21:37:01.7913058Z =================================
2020-11-30T21:37:01.7914190Z [swx-rdmz-ucx-arm-hwi:45330:0:45330] Process frozen...

Link to job: https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=11386&view=logs&j=91485d7f-07f8-5dc4-be43-59794e62e572&t=9aca67d2-2f0a-5509-6a19-ed5514a50604 Link to log: https://dev.azure.com/ucfconsort/0b36e3f0-8ab9-4a48-b68b-4b2350e02c88/_apis/build/builds/11386/logs/155