Open karanveersingh5623 opened 2 years ago
@yosefe @Artemy-Mellanox
[root@bright88 pt2pt]# /cm/shared/apps/openmpi4/gcc/4.1.2/bin/mpirun --mca pml ucx -np 2 -hostfile hostfile ./osu_bw D H
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node001
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened. This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded). Note that
Open MPI stopped checking at the first component that it did not find.
Host: node001
Framework: pml
Component: ucx
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
mca_pml_base_open() failed
--> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:3055474] *** An error occurred in MPI_Init
[node001:3055474] *** reported by process [4059693057,0]
[node001:3055474] *** on a NULL communicator
[node001:3055474] *** Unknown error
[node001:3055474] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:3055474] *** and potentially your MPI job)
[bright88:3788919] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[bright88:3788919] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[bright88:3788919] 1 more process has sent help message help-mca-base.txt / find-available:not-valid
[bright88:3788919] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[bright88:3788919] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
@karanveersingh5623 perhaps OpenMPI was compiled without UCX or cannot find it, what is the output of ompi_info -a|grep ucx
?
This is an amazing debugging process, and I have learned a lot from it. Thank you!
Describe the bug
A clear and concise description of what the bug is.
Trying to run docker container for data preprocessing , its MLPerf Cosmoflow NVIDIA implementation , below is the link The MPI process , trying to run a shell script inside docker container runs fine for training folder but fails for validation , below is the script details [init_datasets.sh]: Please let me know if you need more info
Error msg when running container using srun
Steps to Reproduce
Command line
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v
)Any UCX environment variables used
Setup and versions
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issue
orcat /etc/redhat-release
+uname -a
[root@bright88 burst-buffer]# cat /etc/centos-release CentOS Linux release 7.9.2009 (Core) [root@bright88 burst-buffer]# uname -r 3.10.0-1160.11.1.el7.x86_64For Nvidia Bluefield SmartNIC include
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
Just Using Mellanox connectX-5 tcp stack on all servers
ibstat
oribv_devinfo -vv
commandlsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found |
[root@bright88 ~]# mpiexec --version mpiexec (OpenRTE) 4.1.2
Report bugs to http://www.open-mpi.org/community/help/ [root@bright88 ~]# mpirun --version mpirun (Open MPI) 4.1.2