openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.17k stars 428 forks source link

IO demo failed on sputnik1/sputnik2: completion with error #6377

Open dmitrygx opened 3 years ago

dmitrygx commented 3 years ago

sputnik1: [1613767254.972919] [UCX] conn_id send request 0x231eec0 failed: Connection reset by remote peer
sputnik1: [1613767254.972928] [UCX-connection #1 2.1.4.2:58240] failed to send remote connection id
sputnik1: [1613767254.972933] [UCX-connection #1 2.1.4.2:58240] closing ep 0x7f371f975000 mode force
sputnik1: [1613767254.976248] [UCX] conn_id receive request 0x231f000 failed: Request canceled
sputnik1: [1613767254.976275] [UCX-connection #1 2.1.4.2:58240] destroying, ep is 0
sputnik1: [1613767254.976293] [UCX-connection #1 2.1.4.2:58240] released
sputnik2: 
sputnik2: /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/mlx5/ib_mlx5_log.c: [ uct_ib_mlx5_completion_with_err() ]
sputnik2:       ...
sputnik2:       137     }
sputnik2:       138 
sputnik2:       139     ucs_log(log_level,
sputnik2: ==>   140             "%s on "UCT_IB_IFACE_FMT"/%s (synd 0x%x vend 0x%x hw_synd %d/%d)\n"
sputnik2:       141             "%s QP 0x%x wqe[%d]: %s",
sputnik2:       142             err_info, UCT_IB_IFACE_ARG(iface),
sputnik2:       143             uct_ib_iface_is_roce(iface) ? "RoCE" : "IB",
sputnik2: 
sputnik2: ==== backtrace (tid:  31800) ====
sputnik2:  0 0x0000000000054b95 ucs_debug_print_backtrace()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/ucs/debug/debug.c:656
sputnik2:  1 0x0000000000021c89 uct_ib_mlx5_completion_with_err()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/mlx5/ib_mlx5_log.c:140
sputnik2:  2 0x0000000000034dfa uct_rc_mlx5_iface_poll_rx_cq()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/rc/accel/rc_mlx5.inl:298
sputnik2:  3 0x0000000000034dfa uct_rc_mlx5_iface_common_poll_rx()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/rc/accel/rc_mlx5.inl:1443
sputnik2:  4 0x0000000000034dfa uct_rc_mlx5_iface_progress()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/rc/accel/rc_mlx5_iface.c:143
sputnik2:  5 0x0000000000034dfa uct_rc_mlx5_iface_progress_cyclic()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/ib/rc/accel/rc_mlx5_iface.c:153
sputnik2:  6 0x000000000002f3f2 ucs_callbackq_dispatch()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/ucs/datastruct/callbackq.h:211
sputnik2:  7 0x000000000002f3f2 uct_worker_progress()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/uct/api/uct.h:2436
sputnik2:  8 0x000000000002f3f2 ucp_worker_progress()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/src/ucp/core/ucp_worker.c:2430
sputnik2:  9 0x000000000040656a UcxContext::wait_completion()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/test/apps/iodemo/ucx_wrapper.cc:378
sputnik2: 10 0x0000000000408ab4 UcxConnection::connect_common()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/test/apps/iodemo/ucx_wrapper.cc:802
sputnik2: 11 0x000000000040a7fe UcxConnection::connect()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/test/apps/iodemo/ucx_wrapper.cc:572
sputnik2: 12 0x0000000000412c12 DemoClient::connect()  /auto/rdmzsysgwork/swx-azure-svc/workspace/azure/io_demo_sputnik/1/s/test/apps/iodemo/io_demo.cc:1252

https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=13829&view=logs&j=7b3f24ed-c9de-5cb6-fc71-0c2d50562947&t=7d1a39c0-d225-5e87-9bac-6f47001ee32a

weiguangcui commented 3 years ago

I see the same error with ucx 1.9.0. Any possible simple fix? Thanks.