Open kcgthb opened 4 years ago
Can you try adding UCX_RC_PATH_MTU=1024 ? Currently, UCX does not support different MTU values.
Hi @yosefe Thanks a lot for the quick response!
Unfortunately, setting UCX_RC_PATH_MTU=1024
still fails, albeit with different errors:
with UCX_RC_PATH_MTU=1024
and UCX_TLS=all
:
ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory cma/memory);
ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory);
ucp_worker.c:1543 UCX INFO ep_cfg[3]: tag(dc_mlx5/mlx5_0:1);
uct_iface.c:67 UCX WARN got active message id 0, but no handler installed
uct_iface.c:70 UCX WARN payload 64 of 68 bytes:
uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000
uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000
uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000
uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000
log.c:349 UCX WARN ==== backtrace (tid: 15431) ====
log.c:352 UCX WARN 0 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_print_backtrace+0x27) [0x7fbe187ea687]
log.c:352 UCX WARN 1 /share/software/user/open/ucx/1.8.1/lib/libuct.so.0(+0x104e2) [0x7fbe18c3e4e2]
log.c:352 UCX WARN 2 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x5d891) [0x7fbe0bba5891]
log.c:352 UCX WARN 3 /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fbe1913f50a]
and then the process hangs
with UCX_RC_PATH_MTU=1024
and UCX_TLS=rc
:
ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1);
[sh03-01n03:5611 :0:5611] ib_mlx5_log.c:139 Local protection on mlx5_0:1/IB (synd 0x4 vend 0x51 hw_synd 0/4)
[sh03-01n03:5611 :0:5611] ib_mlx5_log.c:139 RC QP 0x11d95 wqe[0]: SEND --e [va 0x7f00e55f9500 len 34 lkey 0xe4d99]
==== backtrace (tid: 5611) ====
0 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_handle_error+0x1ac) [0x7f00edc0a00c]
1 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_fatal_error_message+0x50) [0x7f00edc06f30]
2 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(+0x28ef8) [0x7f00edc0aef8]
3 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_dispatch+0xcc) [0x7f00edc0b02c]
4 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x2a9) [0x7f00ed135ff9]
5 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x3d207) [0x7f00ed153207]
6 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x709) [0x7f00ed154f69]
7 /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7f00ee56050a]
I played with different values of UCX_RC_PATH_MTU
, but they all produce the same result.
What does " Local protection" mean?
Thanks!
@kcgthb it seems there is some sort of data corruption / access to invalid memory address
Right: one of the host logged plenty of those in dmesg
:
[72970.271488] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7726): reg umr failed (5)
[72970.271504] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7719): reg umr failed (5)
[72970.271523] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7705): reg umr failed (5)
[72970.271532] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7715): reg umr failed (5)
[72970.271535] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7711): reg umr failed (5)
[72970.271553] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7717): reg umr failed (5)
[72970.272232] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7707): reg umr failed (5)
[72970.272719] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7709): reg umr failed (5)
After a reboot, the reg umr
error has cleared, but UCX_TLS=all UCX_RC_PATH_MTU=1024
still produces a synd 0x14
error:
[1595631518.363686] [sh03-01n02:14791:0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(self/memory cma/memory);
[1595631518.364094] [sh03-01n02:14791:0] ucp_worker.c:1543 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory);
[1595631518.403951] [sh03-01n02:14791:0] ucp_worker.c:1543 UCX INFO ep_cfg[3]: tag(dc_mlx5/mlx5_0:1);
[sh03-01n02:14791:0:14791] ib_mlx5_log.c:139 Remote QP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[sh03-01n02:14791:0:14791] ib_mlx5_log.c:139 DCI QP 0x1ca1 wqe[6]: RDMA_READ s-- [rqpn 0x20fc rlid 93] [rva 0x7fb33a733188 rkey 0x6029] [va 0x7fbdea5c1388 len 632 lkey 0x41e6]
==== backtrace (tid: 14791) ====
0 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_handle_error+0x1ac) [0x7fbfd402b00c]
1 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_fatal_error_message+0x50) [0x7fbfd4027f30]
2 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(+0x28ef8) [0x7fbfd402bef8]
3 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_dispatch+0xcc) [0x7fbfd402c02c]
4 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x2a9) [0x7fbfcf12fff9]
5 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x5e1c1) [0x7fbfcf16e1c1]
6 /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fbfd495450a]
UCX_TLS=rc UCX_RC_PATH_MTU=1024
still crashes with a "Local protection" error, no "reg umr failed" error in dmesg
:
[1595631666.617742] [sh03-01n03:15614:0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(rc_mlx5/mlx5_0:1);
[sh03-01n03:15614:0:15614] ib_mlx5_log.c:139 Local protection on mlx5_0:1/IB (synd 0x4 vend 0x51 hw_synd 0/4)
[sh03-01n03:15614:0:15614] ib_mlx5_log.c:139 RC QP 0x2fac wqe[0]: SEND --e [va 0x7f42e15f9500 len 34 lkey 0xd67cf]
==== backtrace (tid: 15614) ====
0 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_handle_error+0x1ac) [0x7f42ea02a00c]
1 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_fatal_error_message+0x50) [0x7f42ea026f30]
2 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(+0x28ef8) [0x7f42ea02aef8]
3 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_dispatch+0xcc) [0x7f42ea02b02c]
4 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x2a9) [0x7f42e9131ff9]
5 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x3d207) [0x7f42e914f207]
6 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x709) [0x7f42e9150f69]
7 /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7f42ea95350a]
Interestingly enough, that issue seems to only arise on HDR HCAs (MT4123, fw 20.27.6008)
If I run the exact same command (with UCX_TLS=rc
, same software, same distribution, same MOFED 5.0) on nodes with EDR cards (MT4115, fw 12.25.1020), there's no error.
So it looks like this could be something specific to HDR cards?
@kcgthb can you try with UCX_TLS=rc_v
?
@yosefe unfortunately, different trace with UCX_TLS=rc_v
, but it still crashes. It gives the following backtrace for each MPI rank:
[1595872567.454712] [sh03-01n69:22026:0] ucp_worker.c:1543 UCX INFO ep_cfg[1]: tag(rc_verbs/mlx5_0:1);
[1595872567.532437] [sh03-01n69:22026:0] uct_iface.c:67 UCX WARN got active message id 0, but no handler installed
[1595872567.532445] [sh03-01n69:22026:0] uct_iface.c:70 UCX WARN payload 32 of 32 bytes:
[1595872567.532445] [sh03-01n69:22026:0] uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000
[1595872567.532445] [sh03-01n69:22026:0] uct_iface.c:70 UCX WARN 00000000:00000000:00000000:00000000
[1595872567.532724] [sh03-01n69:22026:0] log.c:349 UCX WARN ==== backtrace (tid: 22026) ====
[1595872567.532729] [sh03-01n69:22026:0] log.c:352 UCX WARN 0 /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_print_backtrace+0x27) [0x7efd9b7a5687]
[1595872567.532731] [sh03-01n69:22026:0] log.c:352 UCX WARN 1 /share/software/user/open/ucx/1.8.1/lib/libuct.so.0(+0x104e2) [0x7efda005f4e2]
[1595872567.532732] [sh03-01n69:22026:0] log.c:352 UCX WARN 2 /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x2b8d9) [0x7efd9aabb8d9]
[1595872567.532734] [sh03-01n69:22026:0] log.c:352 UCX WARN 3 /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7efda02af50a]
[1595872567.532735] [sh03-01n69:22026:0] log.c:352 UCX WARN 4
I'm starting to see a pattern here, as it looks like the "local protection" errors seem to be node-specific, ie. the exact same command with UCX_TLC=rc
crashes on some node pairs, but works fine on others.
As far as I can tell, the node hardware, firmware and software configurations are all 100% identical: same OS and MLNX OFED version, provisioned the same way on all nodes, same BIOS settings, same IB HCA firmware on all the nodes.
Going on with more tests, I noted that rebooting a node that was causing those issues was sufficient to make it work again with all UCX transports.
Are the "local protection" errors an indication of some hardware or firmware-level state corruption happening, that would be cleared by a reboot?
@kcgthb any chance you are using AMD CPU with IOMMU enabled?
can you try to set iommu=pt
for the kernel boot command?
Those are AMD CPUs indeed (EPYC gen 2) and IOMMU support is enabled in the BIOS.
I'll try to reboot them with iommu=pt
and will report back.
Thanks!
@yosefe iommu=pt
seems to have done the trick, indeed!
I can't reproduce the issue on nodes with that kernel option anymore, and it seems to even have improved performance at the same time.
Thank you so much for figuring it out!
Hi!
I'm getting ABRT errors when trying to run simple HPL tests across a couple nodes with UCX 1.8.x (tested 1.8.0 and 1.8.1).
With
UCX_TLS=ud
, everything works well, but using eitherUCX_TLS=rc
orUCX_TLS=dc
(and by extension,UCX_TLS=all
), I get this for the exact same command:I'm not too sure how to diagnose
Remote QP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
, so any insight would be very much appreciated.Thanks!