Open dmitrygx opened 3 years ago
sputnik1: [1613767254.972919] [UCX] conn_id send request 0x231eec0 failed: Connection reset by remote peer sputnik1: [1613767254.972928] [UCX-connection #1 2.1.4.2:58240] failed to send remote connection id sputnik1: [1613767254.972933] [UCX-connection #1 2.1.4.2:58240] closing ep 0x7f371f975000 mode force sputnik1: [1613767254.976248] [UCX] conn_id receive request 0x231f000 failed: Request canceled sputnik1: [1613767254.976275] [UCX-connection #1 2.1.4.2:58240] destroying, ep is 0 sputnik1: [1613767254.976293] [UCX-connection #1 2.1.4.2:58240] released sputnik2: sputnik2: /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ] sputnik2: ... sputnik2: 137 } sputnik2: 138 sputnik2: 139 ucs_log(log_level, sputnik2: ==> 140 "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n" sputnik2: 141 "%s QP 0x%x wqe[%d]: %s", sputnik2: 142 err_info, UCT_IB_IFACE_ARG(iface), sputnik2: 143 uct_ib_iface_is_roce(iface) ? "RoCE" : "IB", sputnik2: sputnik2: ==== backtrace (tid: 31800) ==== sputnik2: 0 0x0000000000054b95 ucs_debug_print_backtrace() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/ucs/debug/debug.c:656 sputnik2: 1 0x0000000000021c89 uct_ib_mlx5_completion_with_err() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/mlx5/ib_mlx5_log.c:140 sputnik2: 2 0x0000000000034dfa uct_rc_mlx5_iface_poll_rx_cq() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/rc/accel/rc_mlx5.inl:298 sputnik2: 3 0x0000000000034dfa uct_rc_mlx5_iface_common_poll_rx() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/rc/accel/rc_mlx5.inl:1443 sputnik2: 4 0x0000000000034dfa uct_rc_mlx5_iface_progress() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/rc/accel/rc_mlx5_iface.c:143 sputnik2: 5 0x0000000000034dfa uct_rc_mlx5_iface_progress_cyclic() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/rc/accel/rc_mlx5_iface.c:153 sputnik2: 6 0x000000000002f3f2 ucs_callbackq_dispatch() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/ucs/datastruct/callbackq.h:211 sputnik2: 7 0x000000000002f3f2 uct_worker_progress() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/api/uct.h:2436 sputnik2: 8 0x000000000002f3f2 ucp_worker_progress() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/ucp/core/ucp_worker.c:2430 sputnik2: 9 0x000000000040656a UcxContext::wait_completion() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/test/apps/iodemo/ucx_wrapper.cc:378 sputnik2: 10 0x0000000000408ab4 UcxConnection::connect_common() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/test/apps/iodemo/ucx_wrapper.cc:802 sputnik2: 11 0x000000000040a7fe UcxConnection::connect() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/test/apps/iodemo/ucx_wrapper.cc:572 sputnik2: 12 0x0000000000412c12 DemoClient::connect() /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/test/apps/iodemo/io_demo.cc:1252
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=13829&view=logs&j=7b3f24ed-c9de-5cb6-fc71-0c2d50562947&t=7d1a39c0-d225-5e87-9bac-6f47001ee32a
I see the same error with ucx 1.9.0. Any possible simple fix? Thanks.
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=13829&view=logs&j=7b3f24ed-c9de-5cb6-fc71-0c2d50562947&t=7d1a39c0-d225-5e87-9bac-6f47001ee32a