openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.07k stars 409 forks source link

mlx5 connect on mlx5_1 failed: Connection timed out #9971

Open shinoharakazuya opened 1 week ago

shinoharakazuya commented 1 week ago

Describe the bug

I'm running NGC's hpl benchmark test from Slurm. When I ran hpl in an hpl container on two servers with 8 GPUs per node, I encountered a UCX error.

Steps to Reproduce

Setup and versions

Additional information (depending on the issue)

shinoharakazuya commented 1 week ago

logfile.txt

yosefe commented 6 days ago

@shinoharakazuya can you pls post the output of show_gids command, and check if setting UCX_IB_ROCE_LOCAL_SUBNET=y helps to resolve the issue?

changchengx commented 5 days ago

@jandres742 FYI

yosefe commented 5 days ago

NOTE: This issue happens on Nvidia internal cluster