Open shinoharakazuya opened 1 week ago
@shinoharakazuya can you pls post the output of show_gids
command, and check if setting UCX_IB_ROCE_LOCAL_SUBNET=y
helps to resolve the issue?
@jandres742 FYI
NOTE: This issue happens on Nvidia internal cluster
Describe the bug
I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.
Steps to Reproduce
ucx_info -v
): Please see log file.Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
ibstat
oribv_devinfo -vv
commandlsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
: Please see log file.Additional information (depending on the issue)
ucx_info -d
to show transports and devices recognized by UCX: Please see log file.