openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.15k stars 427 forks source link

ABRT / synd 0x14 with some transports #5472

Open kcgthb opened 4 years ago

kcgthb commented 4 years ago

Hi!

I'm getting ABRT errors when trying to run simple HPL tests across a couple nodes with UCX 1.8.x (tested 1.8.0 and 1.8.1).

With UCX_TLS=ud, everything works well, but using either UCX_TLS=rc or UCX_TLS=dc (and by extension, UCX_TLS=all), I get this for the exact same command:

[sh03-01n01:26318:0:26318] ib_mlx5_log.c:139  Remote QP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[sh03-01n01:26318:0:26318] ib_mlx5_log.c:139  DCI QP 0x18162 wqe[41]: RDMA_READ s-- [rqpn 0x86f1 rlid 99] [rva 0x7f0c6162b688 rkey 0x31633] [va 0x7f7935777c88 len 3960 lkey 0x38e3f]
==== backtrace (tid:  26318) ====
 0  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_handle_error+0x1ac) [0x7f79af9b000c]
 1  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_fatal_error_message+0x50) [0x7f79af9acf30]
 2  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(+0x28ef8) [0x7f79af9b0ef8]
 3  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_dispatch+0xcc) [0x7f79af9b102c]
 4  /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x2a9) [0x7f79aecc3ff9]
 5  /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x5e1c1) [0x7f79aed021c1]
 6  /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7f79b44c250a]

I'm not too sure how to diagnose Remote QP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0), so any insight would be very much appreciated.

Thanks!

yosefe commented 4 years ago

Can you try adding UCX_RC_PATH_MTU=1024 ? Currently, UCX does not support different MTU values.

kcgthb commented 4 years ago

Hi @yosefe Thanks a lot for the quick response!

Unfortunately, setting UCX_RC_PATH_MTU=1024 still fails, albeit with different errors:

I played with different values of UCX_RC_PATH_MTU, but they all produce the same result.

What does " Local protection" mean?

Thanks!

yosefe commented 4 years ago

@kcgthb it seems there is some sort of data corruption / access to invalid memory address

kcgthb commented 4 years ago

Right: one of the host logged plenty of those in dmesg:

[72970.271488] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7726): reg umr failed (5)
[72970.271504] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7719): reg umr failed (5)
[72970.271523] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7705): reg umr failed (5)
[72970.271532] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7715): reg umr failed (5)
[72970.271535] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7711): reg umr failed (5)
[72970.271553] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7717): reg umr failed (5)
[72970.272232] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7707): reg umr failed (5)
[72970.272719] infiniband mlx5_0: mlx5_ib_post_send_wait:819:(pid 7709): reg umr failed (5)
kcgthb commented 4 years ago

After a reboot, the reg umr error has cleared, but UCX_TLS=all UCX_RC_PATH_MTU=1024 still produces a synd 0x14 error:

[1595631518.363686] [sh03-01n02:14791:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(self/memory cma/memory); 
[1595631518.364094] [sh03-01n02:14791:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[2]: tag(posix/memory cma/memory); 
[1595631518.403951] [sh03-01n02:14791:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[3]: tag(dc_mlx5/mlx5_0:1); 
[sh03-01n02:14791:0:14791] ib_mlx5_log.c:139  Remote QP on mlx5_0:1/IB (synd 0x14 vend 0x89 hw_synd 0/0)
[sh03-01n02:14791:0:14791] ib_mlx5_log.c:139  DCI QP 0x1ca1 wqe[6]: RDMA_READ s-- [rqpn 0x20fc rlid 93] [rva 0x7fb33a733188 rkey 0x6029] [va 0x7fbdea5c1388 len 632 lkey 0x41e6]
==== backtrace (tid:  14791) ====
 0  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_handle_error+0x1ac) [0x7fbfd402b00c]
 1  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_fatal_error_message+0x50) [0x7fbfd4027f30]
 2  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(+0x28ef8) [0x7fbfd402bef8]
 3  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_dispatch+0xcc) [0x7fbfd402c02c]
 4  /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x2a9) [0x7fbfcf12fff9]
 5  /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x5e1c1) [0x7fbfcf16e1c1]
 6  /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7fbfd495450a]

UCX_TLS=rc UCX_RC_PATH_MTU=1024 still crashes with a "Local protection" error, no "reg umr failed" error in dmesg:

[1595631666.617742] [sh03-01n03:15614:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(rc_mlx5/mlx5_0:1); 
[sh03-01n03:15614:0:15614] ib_mlx5_log.c:139  Local protection on mlx5_0:1/IB (synd 0x4 vend 0x51 hw_synd 0/4)
[sh03-01n03:15614:0:15614] ib_mlx5_log.c:139  RC QP 0x2fac wqe[0]: SEND --e [va 0x7f42e15f9500 len 34 lkey 0xd67cf]
==== backtrace (tid:  15614) ====
 0  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_handle_error+0x1ac) [0x7f42ea02a00c]
 1  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_fatal_error_message+0x50) [0x7f42ea026f30]
 2  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(+0x28ef8) [0x7f42ea02aef8]
 3  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_dispatch+0xcc) [0x7f42ea02b02c]
 4  /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x2a9) [0x7f42e9131ff9]
 5  /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x3d207) [0x7f42e914f207]
 6  /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(uct_rc_mlx5_iface_progress+0x709) [0x7f42e9150f69]
 7  /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7f42ea95350a]

Interestingly enough, that issue seems to only arise on HDR HCAs (MT4123, fw 20.27.6008) If I run the exact same command (with UCX_TLS=rc, same software, same distribution, same MOFED 5.0) on nodes with EDR cards (MT4115, fw 12.25.1020), there's no error.

So it looks like this could be something specific to HDR cards?

yosefe commented 4 years ago

@kcgthb can you try with UCX_TLS=rc_v ?

kcgthb commented 4 years ago

@yosefe unfortunately, different trace with UCX_TLS=rc_v, but it still crashes. It gives the following backtrace for each MPI rank:

[1595872567.454712] [sh03-01n69:22026:0]     ucp_worker.c:1543 UCX  INFO  ep_cfg[1]: tag(rc_verbs/mlx5_0:1); 
[1595872567.532437] [sh03-01n69:22026:0]      uct_iface.c:67   UCX  WARN  got active message id 0, but no handler installed
[1595872567.532445] [sh03-01n69:22026:0]      uct_iface.c:70   UCX  WARN  payload 32 of 32 bytes:
[1595872567.532445] [sh03-01n69:22026:0]      uct_iface.c:70   UCX  WARN  00000000:00000000:00000000:00000000
[1595872567.532445] [sh03-01n69:22026:0]      uct_iface.c:70   UCX  WARN  00000000:00000000:00000000:00000000
[1595872567.532724] [sh03-01n69:22026:0]            log.c:349  UCX  WARN  ==== backtrace (tid:  22026) ====
[1595872567.532729] [sh03-01n69:22026:0]            log.c:352  UCX  WARN   0  /share/software/user/open/ucx/1.8.1/lib/libucs.so.0(ucs_log_print_backtrace+0x27) [0x7efd9b7a5687]
[1595872567.532731] [sh03-01n69:22026:0]            log.c:352  UCX  WARN   1  /share/software/user/open/ucx/1.8.1/lib/libuct.so.0(+0x104e2) [0x7efda005f4e2]
[1595872567.532732] [sh03-01n69:22026:0]            log.c:352  UCX  WARN   2  /share/software/user/open/ucx/1.8.1/lib/ucx/libuct_ib.so.0(+0x2b8d9) [0x7efd9aabb8d9]
[1595872567.532734] [sh03-01n69:22026:0]            log.c:352  UCX  WARN   3  /share/software/user/open/ucx/1.8.1/lib/libucp.so.0(ucp_worker_progress+0x6a) [0x7efda02af50a]
[1595872567.532735] [sh03-01n69:22026:0]            log.c:352  UCX  WARN   4 
kcgthb commented 4 years ago

I'm starting to see a pattern here, as it looks like the "local protection" errors seem to be node-specific, ie. the exact same command with UCX_TLC=rc crashes on some node pairs, but works fine on others.

As far as I can tell, the node hardware, firmware and software configurations are all 100% identical: same OS and MLNX OFED version, provisioned the same way on all nodes, same BIOS settings, same IB HCA firmware on all the nodes.

Going on with more tests, I noted that rebooting a node that was causing those issues was sufficient to make it work again with all UCX transports.

Are the "local protection" errors an indication of some hardware or firmware-level state corruption happening, that would be cleared by a reboot?

yosefe commented 4 years ago

@kcgthb any chance you are using AMD CPU with IOMMU enabled? can you try to set iommu=pt for the kernel boot command?

kcgthb commented 4 years ago

Those are AMD CPUs indeed (EPYC gen 2) and IOMMU support is enabled in the BIOS. I'll try to reboot them with iommu=pt and will report back. Thanks!

kcgthb commented 4 years ago

@yosefe iommu=pt seems to have done the trick, indeed!

I can't reproduce the issue on nodes with that kernel option anymore, and it seems to even have improved performance at the same time.

Thank you so much for figuring it out!