Closed robertsawko closed 2 years ago
@open-mpi/ucx FYI
@robertsawko Which IB Device is present on your system?
Thanks for responding so quickly!
Is this what you are asking?
ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.24.1000
Hardware version: 0
Node GUID: 0x248a07030091fde0
System image GUID: 0x248a07030091fde0
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 53
LMC: 0
SM lid: 17
Capability mask: 0x2651e848
Port GUID: 0x248a07030091fde0
Link layer: InfiniBand
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 12.24.1000
node_guid: 248a:0703:0091:fde0
sys_image_guid: 248a:0703:0091:fde0
vendor_id: 0x02c9
vendor_part_id: 4115
hw_ver: 0x0
board_id: MT_2180110032
phys_port_cnt: 1
Device ports:
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 17
port_lid: 53
port_lmc: 0x00
link_layer: InfiniBand
Thanks, what if you specify -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1
What do you mean by random freezes? Does this mean it happens sporadically, or it simply doesn't go past the first send/recv message? If the above runtime parameters don't help, what's the backtrace on both ranks when it freezes?
@janjust is right, according to your logs, UCX PML disqualify itself because the list of transports was empty.
@robertsawko what is the output of ls -l /sys/class/infiniband/mlx5_0/device/driver
?
Also, can you pls try with latest v4.1.x branch, perhaps f38878e9ce7e4c164a392258d1505544a57666a2 is fixing the issue?
Hi! Thanks again to everyone for all their commitment and responding over the weekend too.
@yosefe
ls -l /sys/class/infiniband/mlx5_0/device/driver
lrwxrwxrwx. 1 root root 0 May 8 22:08 /sys/class/infiniband/mlx5_0/device/driver -> ../../../../bus/pci/drivers/mlx5_core
Also, I am using 4.1.1 stable. But I am happy to recompile with the commit you specified.
@janjust you are right, when I specify:
mpirun -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./IMB-MPI1 Sendrecv
the benchmark runs like a sprint runner on the last 10m of the final day of an Olympic competition with a fighting chance of breaking a world record... Sorry. So is that something that I need to specify? Maybe include in the Lmod file? Why is that list empty?
@yosefe, I can confirm that the problem is actually fixed with 4.1.x - I no longer need to specify the variable and the Sendrecv
produces number that I expect of our Infiniband. Many thanks for pointing this out.
Hello, I would appreciate some advice on the following issue.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OpenMPI 4.1.1 and UCX 1.12.1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
OpenMPI and UCX were installed from sources:
Please describe the system on which you are running
Details of the problem
I am having issues at MPI initialisation stage. As a sanity check I started running Intel MPI Benchmark
The code simply freezes when we reach the actual benchmark. Forcing TCP makes it work which makes me think it's either a hardware problem or still some issue in my setup.
I've used
OMPI_MCA_pml_ucx_verbose=100
following a similar problem I was also having before]and here is the output for just two processes: