openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 419 forks source link

UCX ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported #8440

Open karanveersingh5623 opened 2 years ago

karanveersingh5623 commented 2 years ago

Describe the bug

A clear and concise description of what the bug is.

Trying to run docker container for data preprocessing , its MLPerf Cosmoflow NVIDIA implementation , below is the link The MPI process , trying to run a shell script inside docker container runs fine for training folder but fails for validation , below is the script details [init_datasets.sh]: Please let me know if you need more info

#!/bin/bash

DATA_SRC_DIR="/mnt/cosmoUniverse_2019_05_4parE_tf_small"
DATA_DST_DIR="/mnt/processed"

python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/train -o ${DATA_DST_DIR}/train -c gzip
python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/validation -o ${DATA_DST_DIR}/validation -c gzip

ls -1 ${DATA_DST_DIR}/train | grep "_data.npy" | sort > ${DATA_DST_DIR}/train/files_data.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_data.npy" | sort > ${DATA_DST_DIR}/validation/files_data.lst
ls -1 ${DATA_DST_DIR}/train | grep "_label.npy" | sort > ${DATA_DST_DIR}/train/files_label.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_label.npy" | sort > ${DATA_DST_DIR}/validation/files_label.lst

Error msg when running container using srun

[root@bright88 burst-buffer]# srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: pmix
srun: pmix_v3

[root@bright88 burst-buffer]# ENROOT_ALLOW_HTTP=yes srun --mpi=pmix_v3 -N 1 -G 1 --ntasks=1 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=cosmoflow-preprocess --container-workdir=/mnt/ --container-mounts=/mnt/lustre:/mnt bash /mnt/mxnet/tools/init_datasets.sh

[node001:63628] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[1660021420.383880] [node001:63628:0]    rc_mlx5_devx.c:99   UCX  ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported
[node001:63628] pml_ucx.c:309  Error: Failed to create UCP worker
2022-08-09 14:03:40.430267: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-09 14:03:41.021414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-08-09 14:03:41.593379: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.680931: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.754010: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.823503: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.894452: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.965944: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.039227: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.110450: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.182458: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.254903: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.327541: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.399590: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.470821: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.541492: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.613329: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.685538: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.757489: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.829088: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.900878: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.974708: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.047052: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.117164: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.187375: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.259076: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.330464: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.401330: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.471475: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.542862: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.614219: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.685517: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.756623: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.828099: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
Found 32 files, 0 are done, 32 are remaining.
[node001:64037] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:64037] PMIX ERROR: NOT-FOUND in file ptl_usock.c at line 175
[node001:64037] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node001:64037] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Steps to Reproduce

Setup and versions

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found |


### Additional information (depending on the issue)
- OpenMPI version --> 

[root@bright88 ~]# mpiexec --version mpiexec (OpenRTE) 4.1.2

Report bugs to http://www.open-mpi.org/community/help/ [root@bright88 ~]# mpirun --version mpirun (Open MPI) 4.1.2


- Output of `ucx_info -d` to show transports and devices recognized by UCX
- Configure result - config.log
- Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
karanveersingh5623 commented 1 year ago

@yosefe @Artemy-Mellanox

[root@bright88 pt2pt]# /cm/shared/apps/openmpi4/gcc/4.1.2/bin/mpirun --mca pml ucx -np 2 -hostfile hostfile ./osu_bw D H
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           node001
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:      node001
Framework: pml
Component: ucx
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  mca_pml_base_open() failed
  --> Returned "Not found" (-13) instead of "Success" (0)
--------------------------------------------------------------------------
[node001:3055474] *** An error occurred in MPI_Init
[node001:3055474] *** reported by process [4059693057,0]
[node001:3055474] *** on a NULL communicator
[node001:3055474] *** Unknown error
[node001:3055474] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node001:3055474] ***    and potentially your MPI job)
[bright88:3788919] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[bright88:3788919] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[bright88:3788919] 1 more process has sent help message help-mca-base.txt / find-available:not-valid
[bright88:3788919] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[bright88:3788919] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
yosefe commented 1 year ago

@karanveersingh5623 perhaps OpenMPI was compiled without UCX or cannot find it, what is the output of ompi_info -a|grep ucx ?

sumuzhe317 commented 1 year ago

This is an amazing debugging process, and I have learned a lot from it. Thank you!