openucx / ucx

Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)
http://www.openucx.org
Other
1.12k stars 418 forks source link

UCX ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported #8440

Open karanveersingh5623 opened 2 years ago

karanveersingh5623 commented 2 years ago

Describe the bug

A clear and concise description of what the bug is.

Trying to run docker container for data preprocessing , its MLPerf Cosmoflow NVIDIA implementation , below is the link The MPI process , trying to run a shell script inside docker container runs fine for training folder but fails for validation , below is the script details [init_datasets.sh]: Please let me know if you need more info

#!/bin/bash

DATA_SRC_DIR="/mnt/cosmoUniverse_2019_05_4parE_tf_small"
DATA_DST_DIR="/mnt/processed"

python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/train -o ${DATA_DST_DIR}/train -c gzip
python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/validation -o ${DATA_DST_DIR}/validation -c gzip

ls -1 ${DATA_DST_DIR}/train | grep "_data.npy" | sort > ${DATA_DST_DIR}/train/files_data.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_data.npy" | sort > ${DATA_DST_DIR}/validation/files_data.lst
ls -1 ${DATA_DST_DIR}/train | grep "_label.npy" | sort > ${DATA_DST_DIR}/train/files_label.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_label.npy" | sort > ${DATA_DST_DIR}/validation/files_label.lst

Error msg when running container using srun

[root@bright88 burst-buffer]# srun --mpi=list
srun: MPI types are...
srun: none
srun: pmi2
srun: pmix
srun: pmix_v3

[root@bright88 burst-buffer]# ENROOT_ALLOW_HTTP=yes srun --mpi=pmix_v3 -N 1 -G 1 --ntasks=1 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=cosmoflow-preprocess --container-workdir=/mnt/ --container-mounts=/mnt/lustre:/mnt bash /mnt/mxnet/tools/init_datasets.sh

[node001:63628] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[1660021420.383880] [node001:63628:0]    rc_mlx5_devx.c:99   UCX  ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported
[node001:63628] pml_ucx.c:309  Error: Failed to create UCP worker
2022-08-09 14:03:40.430267: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-09 14:03:41.021414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-08-09 14:03:41.593379: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.680931: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.754010: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.823503: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.894452: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:41.965944: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.039227: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.110450: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.182458: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.254903: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.327541: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.399590: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.470821: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.541492: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.613329: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.685538: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.757489: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.829088: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.900878: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:42.974708: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.047052: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.117164: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.187375: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.259076: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.330464: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.401330: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.471475: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.542862: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.614219: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.685517: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.756623: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-09 14:03:43.828099: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
Found 32 files, 0 are done, 32 are remaining.
[node001:64037] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:64037] PMIX ERROR: NOT-FOUND in file ptl_usock.c at line 175
[node001:64037] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[node001:64037] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Steps to Reproduce

Setup and versions

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found |


### Additional information (depending on the issue)
- OpenMPI version --> 

[root@bright88 ~]# mpiexec --version mpiexec (OpenRTE) 4.1.2

Report bugs to http://www.open-mpi.org/community/help/ [root@bright88 ~]# mpirun --version mpirun (Open MPI) 4.1.2


- Output of `ucx_info -d` to show transports and devices recognized by UCX
- Configure result - config.log
- Log file - configure UCX with "--enable-logging" - and run with "UCX_LOG_LEVEL=data"
karanveersingh5623 commented 2 years ago

@Artemy-Mellanox , if you can share the steps how i can fix this issue in docker container , it will be of great help

yosefe commented 2 years ago

@karanveersingh5623 can you pls add the following env vars to init_datasets.sh and post the output:

export UCX_LOG_LEVEL=info
export UCX_IB_MLX5_DEVX=no
karanveersingh5623 commented 2 years ago

@yosefe , thanks for coming back , please refer below

# ENROOT_ALLOW_HTTP=yes srun --mpi=pmix_v3 -N 1 -G 4 --ntasks=4 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=cosmoflow-preprocess --container-workdir=/mnt/ --container-mounts=/mnt/lustre:/mnt bash /mnt/mxnet/tools/init_datasets.sh

[node001:113245] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:113246] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:113244] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:113243] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
2022-08-17 09:56:06.381232: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-17 09:56:06.384032: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-17 09:56:06.384105: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-17 09:56:06.389060: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-17 09:56:07.581630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-08-17 09:56:07.590577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2022-08-17 09:56:07.743845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 3, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:e3:00.0, compute capability: 8.0
2022-08-17 09:56:07.752427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 2, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2022-08-17 09:56:08.287462: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.295460: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.383547: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.400239: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.423831: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.453386: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.461215: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.484786: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.504828: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.533111: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.553933: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.582050: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.584392: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.620111: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.635133: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.659646: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.676282: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.693794: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.720174: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.728267: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.776139: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.776139: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.802449: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.815474: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.863527: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.881359: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.881356: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.906286: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.960325: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.978037: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.991543: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:09.040572: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
[1660697766.019921] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.026438] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.030515] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.049244] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.069389] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.105421] [node001:113244:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.105529] [node001:113244:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660697766.108265] [node001:113244:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660697766.123882] [node001:113244:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.128259] [node001:113244:async]      ucp_worker.c:1956 UCX  INFO    ep_cfg[3]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.203819] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.209559] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.213760] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.233378] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.253477] [node001:113244:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.296460] [node001:113244:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.300908] [node001:113244:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660697766.317866] [node001:113244:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660697766.019923] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.026411] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.030528] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.049067] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.069472] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.105457] [node001:113245:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.105559] [node001:113245:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660697766.108321] [node001:113245:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660697766.123627] [node001:113245:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.203797] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.209569] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.213757] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.233370] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.253381] [node001:113245:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.296388] [node001:113245:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.300893] [node001:113245:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660697766.319414] [node001:113245:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660697766.019920] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.026445] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.030500] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.049696] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.069474] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.105492] [node001:113243:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.105596] [node001:113243:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660697766.108324] [node001:113243:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660697766.124026] [node001:113243:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.203793] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.209544] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.213774] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.233370] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.253432] [node001:113243:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.296362] [node001:113243:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.300899] [node001:113243:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660697766.320338] [node001:113243:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660697766.020258] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.026422] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.030521] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.049056] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.069449] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.105392] [node001:113246:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.105503] [node001:113246:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660697766.108265] [node001:113246:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660697766.123514] [node001:113246:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.203795] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.209549] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.213748] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.233357] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.253418] [node001:113246:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.296425] [node001:113246:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.300915] [node001:113246:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660697766.320306] [node001:113246:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[node001:114865] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:114865] PMIX ERROR: NOT-FOUND in file ptl_usock.c at line 175
[node001:114865] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
yosefe commented 2 years ago

@karanveersingh5623 does it mean the job managed to run with these parameters?

karanveersingh5623 commented 2 years ago

@karanveersingh5623 does it mean the job managed to run with these parameters?

@yosefe , first job managed to run i.e training dataset , but when it goes to validation dataset , it fails. Above parameters you mentioned just gave extra trace lines

python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/train -o ${DATA_DST_DIR}/train -c gzip
python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/validation -o ${DATA_DST_DIR}/validation -c gzip
yosefe commented 2 years ago

@karanveersingh5623 Is that other failure related to UCX? Since i don't see any more UCX-related errors in the output

karanveersingh5623 commented 2 years ago

@karanveersingh5623 Is that other failure related to UCX? Since i don't see any more UCX-related errors in the output

Below error is not UCX related ??

[node001:63628] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[1660021420.383880] [node001:63628:0]    rc_mlx5_devx.c:99   UCX  ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported
[node001:63628] pml_ucx.c:309  Error: Failed to create UCP worker
yosefe commented 2 years ago

Below error is not UCX related ??

Don't see it in https://github.com/openucx/ucx/issues/8440#issuecomment-1217329624, does it still happen after adding export UCX_IB_MLX5_DEVX=no ?

karanveersingh5623 commented 2 years ago

@yosefe ...oh ok....yea now those are gone but still issue with communications within....dont know what is causing the failed msgs...as a request if you can point me in some direction , it would be helpful

Artemy-Mellanox commented 2 years ago

@karanveersingh5623 could you please run ucx_info -v and post output so we'd know we don't miss something

karanveersingh5623 commented 2 years ago

@Artemy-Mellanox , below is shell script i ran within docker container

DATA_SRC_DIR="/mnt/cosmoUniverse_2019_05_4parE_tf_small"
DATA_DST_DIR="/mnt/processed"
export UCX_LOG_LEVEL=info
export UCX_IB_MLX5_DEVX=no

ucx_info -v

python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/train -o ${DATA_DST_DIR}/train -c gzip
python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/validation -o ${DATA_DST_DIR}/validation -c gzip

ls -1 ${DATA_DST_DIR}/train | grep "_data.npy" | sort > ${DATA_DST_DIR}/train/files_data.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_data.npy" | sort > ${DATA_DST_DIR}/validation/files_data.lst
ls -1 ${DATA_DST_DIR}/train | grep "_label.npy" | sort > ${DATA_DST_DIR}/train/files_label.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_label.npy" | sort > ${DATA_DST_DIR}/validation/files_label.lst

below is the trace generated , train finishes without issues , validation fails --> just two python cmds running in a shell script and i am using just single host

[root@bright88 burst-buffer]# ENROOT_ALLOW_HTTP=yes srun --mpi=pmix_v3 -N 1 -G 4 --ntasks=4 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=cosmoflow-preprocess --container-workdir=/mnt/ --container-mounts=/mnt/lustre:/mnt bash /mnt/mxnet/tools/init_datasets.sh

# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt

[node001:41715] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:41717] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:41718] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:41716] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
2022-08-19 10:30:56.823437: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 10:30:56.823818: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 10:30:56.823995: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 10:30:56.824052: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 10:30:58.028348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 3, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:e3:00.0, compute capability: 8.0
2022-08-19 10:30:58.029114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-08-19 10:30:58.029785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 2, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2022-08-19 10:30:58.265219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory:  -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2022-08-19 10:30:58.862638: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.877999: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.884026: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.942857: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.958647: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.973721: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.979046: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.027725: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.033068: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.073127: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.074661: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.112304: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.120149: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.171104: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.171103: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.196387: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.211329: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.251460: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.265213: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.279228: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.293308: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.326552: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.342905: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.359793: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.374258: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.401641: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.419742: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.438224: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.454582: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.480291: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.496656: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.516348: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
[1660872656.478887] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.482134] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.485205] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.500491] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.509412] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.529735] [node001:41718:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.529846] [node001:41718:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660872656.556293] [node001:41718:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660872656.572946] [node001:41718:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.649737] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.655384] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.660118] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.679927] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.698743] [node001:41718:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.738389] [node001:41718:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.739800] [node001:41718:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660872656.758164] [node001:41718:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.478888] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.482136] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.485145] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.500625] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.509407] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.529695] [node001:41715:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.529804] [node001:41715:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660872656.556233] [node001:41715:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660872656.570917] [node001:41715:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.649765] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.655468] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.660160] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.679731] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.698094] [node001:41715:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.738323] [node001:41715:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.739786] [node001:41715:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660872656.758145] [node001:41715:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.504776] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.508595] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.511891] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.528701] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.537382] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.554993] [node001:41716:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.555091] [node001:41716:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660872656.556214] [node001:41716:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660872656.570769] [node001:41716:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.649756] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.655472] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.660162] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.679597] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.698706] [node001:41716:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.738430] [node001:41716:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.739808] [node001:41716:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660872656.759937] [node001:41716:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.762306] [node001:41716:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[3]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.504772] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.508626] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.513745] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.528710] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.537378] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.554950] [node001:41717:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.555051] [node001:41717:0]          parser.c:1893 UCX  INFO  UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660872656.556322] [node001:41717:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660872656.573022] [node001:41717:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.575952] [node001:41717:async]      ucp_worker.c:1956 UCX  INFO    ep_cfg[3]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.649745] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.655512] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.660165] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.679618] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.698098] [node001:41717:0]            sock.c:128  UCX  DIAG  failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.738355] [node001:41717:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.739798] [node001:41717:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660872656.759921] [node001:41717:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.762297] [node001:41717:0]      ucp_worker.c:1956 UCX  INFO    ep_cfg[3]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[node001:43337] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:43337] PMIX ERROR: NOT-FOUND in file ptl_usock.c at line 175
[node001:43337] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
Artemy-Mellanox commented 2 years ago

This issue was fixed UCX version 1.12.x so you may either upgrade or use export UCX_IB_MLX5_DEVX=no which has same effect.

karanveersingh5623 commented 1 year ago

@Artemy-Mellanox @yosefe

Below is the issue i am facing , please check the trace [1667471376.875185] [node002:217310:0] select.c:513 UCX ERROR no active messages transport to : Unsupported operation [node002:217310] pml_ucx.c:419 Error: ucp_ep_create(proc=0) failed: Destination is unreachable [node002:217310] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 0

srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=2 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh

pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
STARTING TIMING RUN AT 2022-11-03 07:29:27 PM
running benchmark
STARTING TIMING RUN AT 2022-11-03 07:29:27 PM
running benchmark
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=16-31,80-95 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=0-15,64-79 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
STARTING TIMING RUN AT 2022-11-03 07:29:35 PM
running benchmark
STARTING TIMING RUN AT 2022-11-03 07:29:35 PM
running benchmark
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=22
num_sockets = 2 num_nodes=2 cores_per_socket=22
+ exec numactl --physcpubind=22-43,66-87 --membind=1 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
+ exec numactl --physcpubind=0-21,44-65 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
[1667471376.843507] [node002:217310:0]     ucp_context.c:780  UCX  WARN  network device 'mlx5_0:1' is not available, please use one or more of: 'eth4'(tcp), 'lo'(tcp)
[1667471376.868844] [node002:217310:0]          parser.c:1885 UCX  WARN  unused env variable: UCX_IB_MLX5_DEVX (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1667471376.875185] [node002:217310:0]          select.c:513  UCX  ERROR   no active messages transport to <no debug data>: Unsupported operation
[node002:217310] pml_ucx.c:419  Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[node002:217310] pml_ucx.c:472  Error: Failed to resolve UCX endpoint for rank 0
Error in MPI_Isend(52788624, 1, 0x1554426edce0, 0, -27, 23451636022496) (-1)
Error in NBC_Start_round() (-1)
Error in NBC_Start_round() (-1)
karanveersingh5623 commented 1 year ago

@yosefe @Artemy-Mellanox Below is another trace from just using 1 task-per-node .

srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh

pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4 dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number STARTING TIMING RUN AT 2022-11-04 10:39:09 AM running benchmark dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number num_sockets = 2 num_nodes=2 cores_per_socket=32

Segmentation fault: 11

terminate called after throwing an instance of 'std::system_error' what(): Resource deadlock avoided [node002:276417] Process received signal [node002:276417] Signal: Aborted (6) [node002:276417] Signal code: (-6) [node002:276417] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x15555536a210] [node002:276417] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x15555536a18b] [node002:276417] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x155555349859] [node002:276417] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x15550c6d0911] [node002:276417] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x15550c6dc38c] [node002:276417] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x15550c6db369] [node002:276417] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(gxx_personality_v0+0x2a1)[0x15550c6dbd21] [node002:276417] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x155554e6dbef] [node002:276417] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x331)[0x155554e6e281] [node002:276417] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3c)[0x15550c6dc69c] [node002:276417] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20throw_system_errori+0x98)[0x15550c6d373f] [node002:276417] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread6detachEv+0x0)[0x15550c709060] [node002:276417] [12] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0xb58)[0x1554427a97d8] [node002:276417] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x49a27)[0x15555536da27] [node002:276417] [14] /usr/lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x15555536dbe0] [node002:276417] [15] /usr/local/lib/libmxnet.so(+0x17ee46f)[0x1554d4a7b46f] [node002:276417] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x1555553163c0] [node002:276417] [17] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_split_type+0xb1)[0x15544262de31] [node002:276417] [18] /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Comm_split_type+0x2e)[0x15544266442e] [node002:276417] [19] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common10MPIContext10InitializeERKSt6vectorIiSaIiEERNS0_17MPIContextManagerE+0x17c)[0x1554427ecedc] [node002:276417] [20] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x9d9ec)[0x1554427a39ec] [node002:276417] [21] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x15550c708de4] [node002:276417] [22] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x15555530a609] [node002:276417] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x155555446293] [node002:276417] End of error message ./run_and_time.sh: line 211: 276417 Aborted (core dumped) ${LOGGER:-} ${DISTRIBUTED} ${BIND} python train.py "${PARAMS[@]}" slurmstepd: error: mpi/pmix_v3: _errhandler: node002 [1]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.46.0:1] srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: STEP 46.0 ON node001 CANCELLED AT 2022-11-04T10:39:25 srun: error: node002: task 1: Killed srun: error: node001: task 0: Killed

Artemy-Mellanox commented 1 year ago

Could you please run ucx_info -bdvc and ofed_info and attach output here. Run in container like

srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ucx_info -bdvc
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ofed_info
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox

Please refer below

[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ucx_info -bdvc
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
#define UCX_CONFIG_H
#define ENABLE_BUILTIN_MEMCPY     1
#define ENABLE_DEBUG_DATA         0
#define ENABLE_MT                 1
#define ENABLE_PARAMS_CHECK       0
#define HAVE_1_ARG_BFD_SECTION_SIZE 0
#define HAVE_ALLOCA               1
#define HAVE_ALLOCA_H             1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV             1
#define HAVE_CPU_SET_T            1
#define HAVE_CUDA                 1
#define HAVE_CUDA_H               1
#define HAVE_CUDA_RUNTIME_H       1
#define HAVE_DC_DV                1
#define HAVE_DECL_ASPRINTF        1
#define HAVE_DECL_BASENAME        1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 0
#define HAVE_DECL_BFD_GET_SECTION_VMA 0
#define HAVE_DECL_BFD_SECTION_FLAGS 0
#define HAVE_DECL_BFD_SECTION_VMA 0
#define HAVE_DECL_CPU_ISSET       1
#define HAVE_DECL_CPU_ZERO        1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN        1
#define HAVE_DECL_FUSE_MOUNT      0
#define HAVE_DECL_FUSE_OPEN_CHANNEL 0
#define HAVE_DECL_FUSE_UNMOUNT    0
#define HAVE_DECL_F_SETOWN_EX     1
#define HAVE_DECL_GDR_COPY_TO_MAPPING 1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 1
#define HAVE_DECL_IBV_ADVISE_MR   1
#define HAVE_DECL_IBV_ALLOC_DM    1
#define HAVE_DECL_IBV_ALLOC_TD    1
#define HAVE_DECL_IBV_CMD_MODIFY_QP 0
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ  1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 0
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 0
#define HAVE_DECL_IBV_EXP_ALLOC_DM 0
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 0
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 0
#define HAVE_DECL_IBV_EXP_CREATE_QP 0
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 0
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 0
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 0
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 0
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 0
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 0
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_EXP_POST_SEND 0
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 0
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 0
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 0
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 0
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 0
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 0
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 0
#define HAVE_DECL_IBV_EXP_REG_MR  0
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 0
#define HAVE_DECL_IBV_EXP_SETENV  0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 0
#define HAVE_DECL_IBV_EXP_WR_NOP  0
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 1
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID   1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_INOTIFY_ADD_WATCH 1
#define HAVE_DECL_INOTIFY_INIT    1
#define HAVE_DECL_IN_ATTRIB       1
#define HAVE_DECL_IPPROTO_TCP     1
#define HAVE_DECL_MADV_FREE       1
#define HAVE_DECL_MADV_REMOVE     1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 1
#define HAVE_DECL_MLX5DV_CREATE_QP 1
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 1
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 1
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 1
#define HAVE_DECL_MLX5DV_OBJ_AH   1
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 1
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_BF 0
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_NC 0
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER  1
#define HAVE_DECL_SOL_SOCKET      1
#define HAVE_DECL_SO_KEEPALIVE    1
#define HAVE_DECL_SPEED_UNKNOWN   1
#define HAVE_DECL_STRERROR_R      1
#define HAVE_DECL_SYS_BRK         1
#define HAVE_DECL_SYS_IPC         0
#define HAVE_DECL_SYS_MADVISE     1
#define HAVE_DECL_SYS_MMAP        1
#define HAVE_DECL_SYS_MREMAP      1
#define HAVE_DECL_SYS_MUNMAP      1
#define HAVE_DECL_SYS_SHMAT       1
#define HAVE_DECL_SYS_SHMDT       1
#define HAVE_DECL_TCP_KEEPCNT     1
#define HAVE_DECL_TCP_KEEPIDLE    1
#define HAVE_DECL_TCP_KEEPINTVL   1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DEVX                 1
#define HAVE_DLFCN_H              1
#define HAVE_GDRAPI_H             1
#define HAVE_HW_TIMER             1
#define HAVE_IB                   1
#define HAVE_IBV_DM               1
#define HAVE_IN6_ADDR_S6_ADDR32   1
#define HAVE_INFINIBAND_MLX5DV_H  1
#define HAVE_INFINIBAND_TM_TYPES_H 1
#define HAVE_INOTIFY              1
#define HAVE_INTTYPES_H           1
#define HAVE_IP_IP_DST            1
#define HAVE_LIBGEN_H             1
#define HAVE_LIBRT                1
#define HAVE_LINUX_FUTEX_H        1
#define HAVE_LINUX_IP_H           1
#define HAVE_LINUX_MMAN_H         1
#define HAVE_MALLOC_H             1
#define HAVE_MALLOC_HOOK          1
#define HAVE_MALLOC_TRIM          1
#define HAVE_MEMALIGN             1
#define HAVE_MEMORY_H             1
#define HAVE_MLX5_HW              1
#define HAVE_MLX5_HW_UD           1
#define HAVE_MREMAP               1
#define HAVE_NETINET_IP_H         1
#define HAVE_NET_ETHERNET_H       1
#define HAVE_NUMA                 1
#define HAVE_NUMAIF_H             1
#define HAVE_NUMA_H               1
#define HAVE_ODP                  1
#define HAVE_ODP_IMPLICIT         1
#define HAVE_POSIX_MEMALIGN       1
#define HAVE_PREFETCH             1
#define HAVE_SCHED_GETAFFINITY    1
#define HAVE_SCHED_SETAFFINITY    1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T         1
#define HAVE_STDINT_H             1
#define HAVE_STDLIB_H             1
#define HAVE_STRERROR_R           1
#define HAVE_STRINGS_H            1
#define HAVE_STRING_H             1
#define HAVE_STRUCT_BITMASK       1
#define HAVE_STRUCT_DL_PHDR_INFO  1
#define HAVE_STRUCT_IBV_DEVICE_ATTR_EX_PCI_ATOMIC_CAPS 1
#define HAVE_STRUCT_IBV_TM_CAPS_FLAGS 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_SYS_EPOLL_H          1
#define HAVE_SYS_EVENTFD_H        1
#define HAVE_SYS_STAT_H           1
#define HAVE_SYS_TYPES_H          1
#define HAVE_SYS_UIO_H            1
#define HAVE_TL_DC                1
#define HAVE_TL_RC                1
#define HAVE_TL_UD                1
#define HAVE_UCM_PTMALLOC286      1
#define HAVE_UNISTD_H             1
#define HAVE___CLEAR_CACHE        1
#define HAVE___CURBRK             1
#define HAVE___SIGHANDLER_T       1
#define IBV_HW_TM                 1
#define LT_OBJDIR                 ".libs/"
#define NVALGRIND                 1
#define PACKAGE                   "ucx"
#define PACKAGE_BUGREPORT         ""
#define PACKAGE_NAME              "ucx"
#define PACKAGE_STRING            "ucx 1.11"
#define PACKAGE_TARNAME           "ucx"
#define PACKAGE_URL               ""
#define PACKAGE_VERSION           "1.11"
#define STDC_HEADERS              1
#define STRERROR_R_CHAR_P         1
#define UCM_BISTRO_HOOKS          1
#define UCS_MAX_LOG_LEVEL         UCS_LOG_LEVEL_DEBUG
#define UCT_TCP_EP_KEEPALIVE      1
#define UCT_UD_EP_DEBUG_HOOKS     0
#define UCX_CONFIGURE_FLAGS       "--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt"
#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.11"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucg_MODULES               ":builtin"
#define ucm_MODULES               ":cuda"
#define ucs_MODULES               ""
#define uct_MODULES               ":cuda:ib:rdmacm:cma:xpmem"
#define uct_cuda_MODULES          ":gdrcopy"
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ":cuda"
#
# Memory domain: posix
#     Component: posix
#             allocate: unlimited
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: lo
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: eth3
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11316.36/ppn + 0.00 MB/sec
#              latency: 5206 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: cuda_cpy
#     Component: cuda_cpy
#             allocate: unlimited
#             register: unlimited, cost: 0 nsec
#
#      Transport: cuda_copy
#         Device: cuda
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 10000.00/ppn + 0.00 MB/sec
#              latency: 8000 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_short: <= 4294967295
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: cuda_ipc
#     Component: cuda_ipc
#             register: unlimited, cost: 0 nsec
#           remote key: 112 bytes
#
#      Transport: cuda_ipc
#         Device: cuda
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 300000.00/ppn + 0.00 MB/sec
#              latency: 1 nsec
#             overhead: 0 nsec
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
# < failed to open connection manager rdmacm >
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
UCX_LOG_LEVEL=WARN
UCX_LOG_FILE_FILTER=*
UCX_LOG_FILE=
UCX_LOG_FILE_SIZE=inf
UCX_LOG_FILE_ROTATE=0
UCX_LOG_BUFFER=1K
UCX_LOG_DATA_SIZE=0
UCX_LOG_PRINT_ENABLE=n
UCX_HANDLE_ERRORS=bt
UCX_ERROR_SIGNALS=ILL,SEGV,BUS,FPE
UCX_ERROR_MAIL_TO=
UCX_ERROR_MAIL_FOOTER=
UCX_GDB_COMMAND=gdb -quiet
UCX_DEBUG_SIGNO=HUP
UCX_LOG_LEVEL_TRIGGER=FATAL
UCX_WARN_UNUSED_ENV_VARS=y
UCX_ASYNC_MAX_EVENTS=1024
UCX_ASYNC_SIGNO=ALRM
UCX_VFS_ENABLE=y
UCX_PROFILE_MODE=
UCX_PROFILE_FILE=ucx_%h_%p.prof
UCX_PROFILE_LOG_SIZE=4M
UCX_RCACHE_CHECK_PFN=0
UCX_MODULE_DIR=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt/lib/ucx
UCX_MODULE_LOG_LEVEL=TRACE
UCX_BUILTIN_MEMCPY_MIN=auto
UCX_BUILTIN_MEMCPY_MAX=auto
UCX_MEM_LOG_LEVEL=WARN
UCX_MEM_ALLOC_ALIGN=16
UCX_MEM_EVENTS=y
UCX_MEM_MMAP_HOOK_MODE=bistro
UCX_MEM_MALLOC_HOOKS=y
UCX_MEM_MALLOC_RELOC=y
UCX_MEM_CUDA_HOOK_MODE=bistro
UCX_MEM_DYNAMIC_MMAP_THRESH=y
UCX_MEM_DLOPEN_PROCESS_RPATH=y
UCX_MEM_MODULE_UNLOAD_PREVENT_MODE=lazy
UCX_POSIX_HUGETLB_MODE=try
UCX_POSIX_DIR=/dev/shm
UCX_POSIX_USE_PROC_LINK=y
UCX_POSIX_ALLOC=md,mmap,heap
UCX_POSIX_FAILURE=DIAG
UCX_POSIX_MAX_NUM_EPS=inf
UCX_POSIX_BW=12179.00MBps
UCX_POSIX_FIFO_SIZE=64
UCX_POSIX_SEG_SIZE=8256
UCX_POSIX_FIFO_RELEASE_FACTOR=0.500
UCX_POSIX_RX_MAX_BUFS=-1
UCX_POSIX_RX_BUFS_GROW=512
UCX_POSIX_FIFO_HUGETLB=n
UCX_POSIX_FIFO_ELEM_SIZE=128
UCX_POSIX_FIFO_MAX_POLL=16
UCX_POSIX_ERROR_HANDLING=n
UCX_SYSV_HUGETLB_MODE=try
UCX_SYSV_ALLOC=md,mmap,heap
UCX_SYSV_FAILURE=DIAG
UCX_SYSV_MAX_NUM_EPS=inf
UCX_SYSV_BW=12179.00MBps
UCX_SYSV_FIFO_SIZE=64
UCX_SYSV_SEG_SIZE=8256
UCX_SYSV_FIFO_RELEASE_FACTOR=0.500
UCX_SYSV_RX_MAX_BUFS=-1
UCX_SYSV_RX_BUFS_GROW=512
UCX_SYSV_FIFO_HUGETLB=n
UCX_SYSV_FIFO_ELEM_SIZE=128
UCX_SYSV_FIFO_MAX_POLL=16
UCX_SYSV_ERROR_HANDLING=n
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_SELF_FAILURE=DIAG
UCX_SELF_MAX_NUM_EPS=inf
UCX_SELF_SEG_SIZE=8K
UCX_SELF_NUM_DEVICES=1
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_FAILURE=DIAG
UCX_TCP_MAX_NUM_EPS=256
UCX_TCP_TX_SEG_SIZE=8K
UCX_TCP_RX_SEG_SIZE=64K
UCX_TCP_MAX_IOV=6
UCX_TCP_SENDV_THRESH=2K
UCX_TCP_PREFER_DEFAULT=y
UCX_TCP_PUT_ENABLE=y
UCX_TCP_CONN_NB=n
UCX_TCP_MAX_POLL=16
UCX_TCP_MAX_CONN_RETRIES=25
UCX_TCP_NODELAY=y
UCX_TCP_SNDBUF=auto
UCX_TCP_RCVBUF=auto
UCX_TCP_SYN_CNT=auto
UCX_TCP_TX_MAX_BUFS=-1
UCX_TCP_TX_BUFS_GROW=8
UCX_TCP_RX_MAX_BUFS=-1
UCX_TCP_RX_BUFS_GROW=8
UCX_TCP_PORT_RANGE=0
UCX_TCP_KEEPIDLE=10000000.00us
UCX_TCP_KEEPCNT=3
UCX_TCP_KEEPINTVL=1000000.00us
UCX_TCP_CM_FAILURE=DIAG
UCX_TCP_CM_REUSEADDR=n
UCX_TCP_CM_PRIV_DATA_LEN=2K
UCX_TCP_CM_SNDBUF=auto
UCX_TCP_CM_RCVBUF=auto
UCX_TCP_CM_SYN_CNT=auto
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_TLS_PRIORITY=rdmacm,tcp,sockcm
UCX_SOCKADDR_AUX_TLS=ud
UCX_SELECT_DISTANCE_MD=cuda_cpy
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=0
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256K
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MULTI_LANE_MAX_RATIO=4.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=2
UCX_RNDV_SCHEME=auto
UCX_RKEY_PTR_SEG_SIZE=512K
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=auto
UCX_ATOMIC_MODE=guess
UCX_ADDRESS_DEBUG_INFO=n
UCX_MAX_WORKER_ADDRESS_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8K
UCX_TM_THRESH=1K
UCX_TM_MAX_BB_SIZE=1K
UCX_TM_FORCE_THRESH=8K
UCX_TM_SW_RNDV=n
UCX_NUM_EPS=auto
UCX_NUM_PPN=auto
UCX_RNDV_FRAG_SIZE=512K
UCX_RNDV_PIPELINE_SEND_THRESH=inf
UCX_MEMTYPE_CACHE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
UCX_CM_USE_ALL_DEVICES=y
UCX_LISTENER_BACKLOG=auto
UCX_PROTO_ENABLE=n
UCX_KEEPALIVE_INTERVAL=60000000.00us
UCX_KEEPALIVE_NUM_EPS=128
UCX_PROTO_INDIRECT_ID=auto
UCX_ERROR_HANDLER_DELAY=0.00us
UCX_CUDA_COPY_ALLOC=huge,thp,md,mmap,heap
UCX_CUDA_COPY_FAILURE=DIAG
UCX_CUDA_COPY_MAX_NUM_EPS=inf
UCX_CUDA_COPY_MAX_POLL=16
UCX_CUDA_COPY_MAX_EVENTS=inf
UCX_CUDA_IPC_ALLOC=huge,thp,md,mmap,heap
UCX_CUDA_IPC_FAILURE=DIAG
UCX_CUDA_IPC_MAX_NUM_EPS=inf
UCX_CUDA_IPC_MAX_POLL=16
UCX_CUDA_IPC_MAX_STREAMS=16
UCX_CUDA_IPC_CACHE=y
UCX_CUDA_IPC_MAX_EVENTS=inf
UCX_GDR_COPY_RCACHE=try
UCX_GDR_COPY_RCACHE_MEM_PRIO=1000
UCX_GDR_COPY_RCACHE_OVERHEAD=0.18us
UCX_GDR_COPY_RCACHE_ADDR_ALIGN=65536
UCX_GDR_COPY_RCACHE_MAX_REGIONS=inf
UCX_GDR_COPY_RCACHE_MAX_SIZE=inf
UCX_GDR_COPY_MEM_REG_OVERHEAD=16.00us
UCX_GDR_COPY_MEM_REG_GROWTH=0.00us
UCX_GDR_COPY_ALLOC=huge,thp,md,mmap,heap
UCX_GDR_COPY_FAILURE=DIAG
UCX_GDR_COPY_MAX_NUM_EPS=inf
UCX_IB_REG_METHODS=rcache,odp,direct
UCX_IB_RCACHE_MEM_PRIO=1000
UCX_IB_RCACHE_OVERHEAD=0.18us
UCX_IB_RCACHE_ADDR_ALIGN=16
UCX_IB_RCACHE_MAX_REGIONS=inf
UCX_IB_RCACHE_MAX_SIZE=inf
UCX_IB_MEM_REG_OVERHEAD=16.00us
UCX_IB_MEM_REG_GROWTH=0.00us
UCX_IB_FORK_INIT=try
UCX_IB_ASYNC_EVENTS=y
UCX_IB_ETH_PAUSE_ON=y
UCX_IB_ODP_NUMA_POLICY=preferred
UCX_IB_ODP_PREFETCH=n
UCX_IB_ODP_MAX_SIZE=auto
UCX_IB_DEVICE_SPECS=
UCX_IB_PREFER_NEAREST_DEVICE=y
UCX_IB_INDIRECT_ATOMIC=y
UCX_IB_GID_INDEX=auto
UCX_IB_SUBNET_PREFIX=
UCX_IB_GPU_DIRECT_RDMA=try
UCX_IB_PCI_BW=
UCX_IB_MLX5_DEVX=n
UCX_IB_MLX5_DEVX_OBJECTS=rcqp,rcsrq,dct,dcsrq,dci
UCX_IB_REG_MT_THRESH=4G
UCX_IB_REG_MT_CHUNK=2G
UCX_IB_REG_MT_BIND=n
UCX_IB_PCI_RELAXED_ORDERING=auto
UCX_RC_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_RC_VERBS_FAILURE=DIAG
UCX_RC_VERBS_MAX_NUM_EPS=256
UCX_RC_VERBS_SEG_SIZE=8256
UCX_RC_VERBS_TX_QUEUE_LEN=256
UCX_RC_VERBS_TX_MAX_BATCH=16
UCX_RC_VERBS_TX_MAX_POLL=16
UCX_RC_VERBS_TX_MIN_INLINE=64
UCX_RC_VERBS_TX_INLINE_RESP=64
UCX_RC_VERBS_TX_MIN_SGE=4
UCX_RC_VERBS_TX_MAX_BUFS=-1
UCX_RC_VERBS_TX_BUFS_GROW=1024
UCX_RC_VERBS_RX_QUEUE_LEN=4095
UCX_RC_VERBS_RX_MAX_BATCH=16
UCX_RC_VERBS_RX_MAX_POLL=16
UCX_RC_VERBS_RX_INLINE=64
UCX_RC_VERBS_RX_MAX_BUFS=-1
UCX_RC_VERBS_RX_BUFS_GROW=0
UCX_RC_VERBS_ADDR_TYPE=auto
UCX_RC_VERBS_IS_GLOBAL=n
UCX_RC_VERBS_SL=auto
UCX_RC_VERBS_TRAFFIC_CLASS=auto
UCX_RC_VERBS_HOP_LIMIT=255
UCX_RC_VERBS_NUM_PATHS=auto
UCX_RC_VERBS_ROCE_LOCAL_SUBNET=n
UCX_RC_VERBS_ROCE_PATH_FACTOR=1
UCX_RC_VERBS_LID_PATH_BITS=0
UCX_RC_VERBS_PKEY=auto
UCX_RC_VERBS_PATH_MTU=default
UCX_RC_VERBS_MAX_RD_ATOMIC=4
UCX_RC_VERBS_TIMEOUT=1000000.00us
UCX_RC_VERBS_RETRY_COUNT=7
UCX_RC_VERBS_RNR_TIMEOUT=1000.00us
UCX_RC_VERBS_RNR_RETRY_COUNT=7
UCX_RC_VERBS_FC_ENABLE=y
UCX_RC_VERBS_FC_WND_SIZE=512
UCX_RC_VERBS_FC_HARD_THRESH=0.250
UCX_RC_VERBS_FENCE=auto
UCX_RC_VERBS_MAX_GET_ZCOPY=auto
UCX_RC_VERBS_TX_NUM_GET_BYTES=inf
UCX_RC_VERBS_TX_POLL_ALWAYS=n
UCX_RC_VERBS_FC_SOFT_THRESH=0.500
UCX_RC_VERBS_TX_CQ_MODERATION=64
UCX_RC_VERBS_TX_CQ_LEN=4096
UCX_RC_VERBS_MAX_AM_HDR=128
UCX_RC_VERBS_TX_MAX_WR=inf
UCX_RC_VERBS_FLUSH_MODE=auto
UCX_RC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_RC_MLX5_FAILURE=DIAG
UCX_RC_MLX5_MAX_NUM_EPS=256
UCX_RC_MLX5_SEG_SIZE=8256
UCX_RC_MLX5_TX_QUEUE_LEN=256
UCX_RC_MLX5_TX_MAX_BATCH=16
UCX_RC_MLX5_TX_MAX_POLL=16
UCX_RC_MLX5_TX_MIN_INLINE=64
UCX_RC_MLX5_TX_INLINE_RESP=64
UCX_RC_MLX5_TX_MIN_SGE=4
UCX_RC_MLX5_TX_MAX_BUFS=-1
UCX_RC_MLX5_TX_BUFS_GROW=1024
UCX_RC_MLX5_RX_QUEUE_LEN=4095
UCX_RC_MLX5_RX_MAX_BATCH=16
UCX_RC_MLX5_RX_MAX_POLL=16
UCX_RC_MLX5_RX_INLINE=64
UCX_RC_MLX5_RX_MAX_BUFS=-1
UCX_RC_MLX5_RX_BUFS_GROW=0
UCX_RC_MLX5_ADDR_TYPE=auto
UCX_RC_MLX5_IS_GLOBAL=n
UCX_RC_MLX5_SL=auto
UCX_RC_MLX5_TRAFFIC_CLASS=auto
UCX_RC_MLX5_HOP_LIMIT=255
UCX_RC_MLX5_NUM_PATHS=auto
UCX_RC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_RC_MLX5_ROCE_PATH_FACTOR=1
UCX_RC_MLX5_LID_PATH_BITS=0
UCX_RC_MLX5_PKEY=auto
UCX_RC_MLX5_PATH_MTU=default
UCX_RC_MLX5_MAX_RD_ATOMIC=4
UCX_RC_MLX5_TIMEOUT=1000000.00us
UCX_RC_MLX5_RETRY_COUNT=7
UCX_RC_MLX5_RNR_TIMEOUT=1000.00us
UCX_RC_MLX5_RNR_RETRY_COUNT=7
UCX_RC_MLX5_FC_ENABLE=y
UCX_RC_MLX5_FC_WND_SIZE=512
UCX_RC_MLX5_FC_HARD_THRESH=0.250
UCX_RC_MLX5_FENCE=auto
UCX_RC_MLX5_MAX_GET_ZCOPY=auto
UCX_RC_MLX5_TX_NUM_GET_BYTES=inf
UCX_RC_MLX5_TX_POLL_ALWAYS=n
UCX_RC_MLX5_FC_SOFT_THRESH=0.500
UCX_RC_MLX5_TX_CQ_MODERATION=64
UCX_RC_MLX5_TX_CQ_LEN=4096
UCX_RC_MLX5_DM_SIZE=2K
UCX_RC_MLX5_DM_COUNT=1
UCX_RC_MLX5_MMIO_MODE=auto
UCX_RC_MLX5_AR_ENABLE=auto
UCX_RC_MLX5_TX_MAX_BB=inf
UCX_RC_MLX5_TM_ENABLE=n
UCX_RC_MLX5_TM_LIST_SIZE=1024
UCX_RC_MLX5_TM_SEG_SIZE=48K
UCX_RC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_RC_MLX5_TM_MP_NUM_STRIDES=8
UCX_RC_MLX5_EXP_BACKOFF=0
UCX_RC_MLX5_SRQ_TOPO=cyclic,cyclic_emulated
UCX_DC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_DC_MLX5_FAILURE=DIAG
UCX_DC_MLX5_MAX_NUM_EPS=inf
UCX_DC_MLX5_SEG_SIZE=8256
UCX_DC_MLX5_TX_QUEUE_LEN=128
UCX_DC_MLX5_TX_MAX_BATCH=16
UCX_DC_MLX5_TX_MAX_POLL=16
UCX_DC_MLX5_TX_MIN_INLINE=64
UCX_DC_MLX5_TX_INLINE_RESP=64
UCX_DC_MLX5_TX_MIN_SGE=4
UCX_DC_MLX5_TX_MAX_BUFS=-1
UCX_DC_MLX5_TX_BUFS_GROW=1024
UCX_DC_MLX5_RX_QUEUE_LEN=4095
UCX_DC_MLX5_RX_MAX_BATCH=16
UCX_DC_MLX5_RX_MAX_POLL=16
UCX_DC_MLX5_RX_INLINE=64
UCX_DC_MLX5_RX_MAX_BUFS=-1
UCX_DC_MLX5_RX_BUFS_GROW=0
UCX_DC_MLX5_ADDR_TYPE=auto
UCX_DC_MLX5_IS_GLOBAL=n
UCX_DC_MLX5_SL=auto
UCX_DC_MLX5_TRAFFIC_CLASS=auto
UCX_DC_MLX5_HOP_LIMIT=255
UCX_DC_MLX5_NUM_PATHS=auto
UCX_DC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_DC_MLX5_ROCE_PATH_FACTOR=1
UCX_DC_MLX5_LID_PATH_BITS=0
UCX_DC_MLX5_PKEY=auto
UCX_DC_MLX5_PATH_MTU=default
UCX_DC_MLX5_MAX_RD_ATOMIC=4
UCX_DC_MLX5_TIMEOUT=1000000.00us
UCX_DC_MLX5_RETRY_COUNT=7
UCX_DC_MLX5_RNR_TIMEOUT=1000.00us
UCX_DC_MLX5_RNR_RETRY_COUNT=7
UCX_DC_MLX5_FC_ENABLE=y
UCX_DC_MLX5_FC_WND_SIZE=512
UCX_DC_MLX5_FC_HARD_THRESH=0.250
UCX_DC_MLX5_FENCE=auto
UCX_DC_MLX5_MAX_GET_ZCOPY=auto
UCX_DC_MLX5_TX_NUM_GET_BYTES=inf
UCX_DC_MLX5_TX_POLL_ALWAYS=n
UCX_DC_MLX5_DM_SIZE=2K
UCX_DC_MLX5_DM_COUNT=1
UCX_DC_MLX5_MMIO_MODE=auto
UCX_DC_MLX5_AR_ENABLE=auto
UCX_DC_MLX5_TX_MAX_BB=inf
UCX_DC_MLX5_TM_ENABLE=n
UCX_DC_MLX5_TM_LIST_SIZE=1024
UCX_DC_MLX5_TM_SEG_SIZE=48K
UCX_DC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_DC_MLX5_TM_MP_NUM_STRIDES=8
UCX_DC_MLX5_EXP_BACKOFF=0
UCX_DC_MLX5_SRQ_TOPO=list
UCX_DC_MLX5_RX_QUEUE_LEN_INIT=128
UCX_DC_MLX5_NUM_DCI=8
UCX_DC_MLX5_TX_POLICY=dcs_quota
UCX_DC_MLX5_DCI_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCI_KA_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCT_FULL_HANDSHAKE=n
UCX_DC_MLX5_RAND_DCI_SEED=0
UCX_DC_MLX5_QUOTA=32
UCX_DC_MLX5_FC_HARD_REQ_TIMEOUT=5000000.00us
UCX_DC_MLX5_COMPACT_AV=y
UCX_UD_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_UD_VERBS_FAILURE=DIAG
UCX_UD_VERBS_MAX_NUM_EPS=inf
UCX_UD_VERBS_SEG_SIZE=8K
UCX_UD_VERBS_TX_QUEUE_LEN=256
UCX_UD_VERBS_TX_MAX_BATCH=16
UCX_UD_VERBS_TX_MAX_POLL=16
UCX_UD_VERBS_TX_MIN_INLINE=64
UCX_UD_VERBS_TX_INLINE_RESP=0
UCX_UD_VERBS_TX_MIN_SGE=4
UCX_UD_VERBS_TX_MAX_BUFS=-1
UCX_UD_VERBS_TX_BUFS_GROW=1024
UCX_UD_VERBS_RX_QUEUE_LEN=4096
UCX_UD_VERBS_RX_MAX_BATCH=16
UCX_UD_VERBS_RX_MAX_POLL=16
UCX_UD_VERBS_RX_INLINE=0
UCX_UD_VERBS_RX_MAX_BUFS=-1
UCX_UD_VERBS_RX_BUFS_GROW=0
UCX_UD_VERBS_ADDR_TYPE=auto
UCX_UD_VERBS_IS_GLOBAL=n
UCX_UD_VERBS_SL=auto
UCX_UD_VERBS_TRAFFIC_CLASS=auto
UCX_UD_VERBS_HOP_LIMIT=255
UCX_UD_VERBS_NUM_PATHS=auto
UCX_UD_VERBS_ROCE_LOCAL_SUBNET=n
UCX_UD_VERBS_ROCE_PATH_FACTOR=1
UCX_UD_VERBS_LID_PATH_BITS=0
UCX_UD_VERBS_PKEY=auto
UCX_UD_VERBS_PATH_MTU=default
UCX_UD_VERBS_RX_QUEUE_LEN_INIT=128
UCX_UD_VERBS_TIMEOUT=300000000.00us
UCX_UD_VERBS_TIMER_TICK=10000.00us
UCX_UD_VERBS_TIMER_BACKOFF=2.000
UCX_UD_VERBS_ASYNC_TIMER_TICK=100000.00us
UCX_UD_VERBS_MIN_POKE_TIME=250000.00us
UCX_UD_VERBS_ETH_DGID_CHECK=y
UCX_UD_VERBS_MAX_WINDOW=1025
UCX_UD_VERBS_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_UD_MLX5_FAILURE=DIAG
UCX_UD_MLX5_MAX_NUM_EPS=inf
UCX_UD_MLX5_SEG_SIZE=8K
UCX_UD_MLX5_TX_QUEUE_LEN=256
UCX_UD_MLX5_TX_MAX_BATCH=16
UCX_UD_MLX5_TX_MAX_POLL=16
UCX_UD_MLX5_TX_MIN_INLINE=64
UCX_UD_MLX5_TX_INLINE_RESP=0
UCX_UD_MLX5_TX_MIN_SGE=4
UCX_UD_MLX5_TX_MAX_BUFS=-1
UCX_UD_MLX5_TX_BUFS_GROW=1024
UCX_UD_MLX5_RX_QUEUE_LEN=4096
UCX_UD_MLX5_RX_MAX_BATCH=16
UCX_UD_MLX5_RX_MAX_POLL=16
UCX_UD_MLX5_RX_INLINE=0
UCX_UD_MLX5_RX_MAX_BUFS=-1
UCX_UD_MLX5_RX_BUFS_GROW=0
UCX_UD_MLX5_ADDR_TYPE=auto
UCX_UD_MLX5_IS_GLOBAL=n
UCX_UD_MLX5_SL=auto
UCX_UD_MLX5_TRAFFIC_CLASS=auto
UCX_UD_MLX5_HOP_LIMIT=255
UCX_UD_MLX5_NUM_PATHS=auto
UCX_UD_MLX5_ROCE_LOCAL_SUBNET=n
UCX_UD_MLX5_ROCE_PATH_FACTOR=1
UCX_UD_MLX5_LID_PATH_BITS=0
UCX_UD_MLX5_PKEY=auto
UCX_UD_MLX5_PATH_MTU=default
UCX_UD_MLX5_RX_QUEUE_LEN_INIT=128
UCX_UD_MLX5_TIMEOUT=300000000.00us
UCX_UD_MLX5_TIMER_TICK=10000.00us
UCX_UD_MLX5_TIMER_BACKOFF=2.000
UCX_UD_MLX5_ASYNC_TIMER_TICK=100000.00us
UCX_UD_MLX5_MIN_POKE_TIME=250000.00us
UCX_UD_MLX5_ETH_DGID_CHECK=y
UCX_UD_MLX5_MAX_WINDOW=1025
UCX_UD_MLX5_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_DM_SIZE=2K
UCX_UD_MLX5_DM_COUNT=1
UCX_UD_MLX5_MMIO_MODE=auto
UCX_UD_MLX5_AR_ENABLE=auto
UCX_UD_MLX5_COMPACT_AV=y
UCX_RDMA_CM_FAILURE=DIAG
UCX_RDMA_CM_REUSEADDR=n
UCX_RDMA_CM_SOURCE_ADDRESS=
UCX_RDMA_CM_TIMEOUT=10000000.00us
UCX_RDMA_CM_RESERVED_QPN=try
UCX_CMA_ALLOC=huge,thp,mmap,heap
UCX_CMA_FAILURE=DIAG
UCX_CMA_MAX_NUM_EPS=inf
UCX_CMA_BW=11145.00MBps
UCX_CMA_MAX_IOV=16
UCX_CMA_SEG_SIZE=512K
UCX_CMA_TX_QUOTA=1
UCX_CMA_TX_MAX_BUFS=-1
UCX_CMA_TX_BUFS_GROW=8
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
#define UCX_CONFIG_H
#define ENABLE_BUILTIN_MEMCPY     1
#define ENABLE_DEBUG_DATA         0
#define ENABLE_MT                 1
#define ENABLE_PARAMS_CHECK       0
#define HAVE_1_ARG_BFD_SECTION_SIZE 0
#define HAVE_ALLOCA               1
#define HAVE_ALLOCA_H             1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV             1
#define HAVE_CPU_SET_T            1
#define HAVE_CUDA                 1
#define HAVE_CUDA_H               1
#define HAVE_CUDA_RUNTIME_H       1
#define HAVE_DC_DV                1
#define HAVE_DECL_ASPRINTF        1
#define HAVE_DECL_BASENAME        1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 0
#define HAVE_DECL_BFD_GET_SECTION_VMA 0
#define HAVE_DECL_BFD_SECTION_FLAGS 0
#define HAVE_DECL_BFD_SECTION_VMA 0
#define HAVE_DECL_CPU_ISSET       1
#define HAVE_DECL_CPU_ZERO        1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN        1
#define HAVE_DECL_FUSE_MOUNT      0
#define HAVE_DECL_FUSE_OPEN_CHANNEL 0
#define HAVE_DECL_FUSE_UNMOUNT    0
#define HAVE_DECL_F_SETOWN_EX     1
#define HAVE_DECL_GDR_COPY_TO_MAPPING 1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 1
#define HAVE_DECL_IBV_ADVISE_MR   1
#define HAVE_DECL_IBV_ALLOC_DM    1
#define HAVE_DECL_IBV_ALLOC_TD    1
#define HAVE_DECL_IBV_CMD_MODIFY_QP 0
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ  1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 0
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 0
#define HAVE_DECL_IBV_EXP_ALLOC_DM 0
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 0
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 0
#define HAVE_DECL_IBV_EXP_CREATE_QP 0
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 0
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 0
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 0
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 0
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 0
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 0
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_EXP_POST_SEND 0
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 0
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 0
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 0
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 0
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 0
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 0
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 0
#define HAVE_DECL_IBV_EXP_REG_MR  0
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 0
#define HAVE_DECL_IBV_EXP_SETENV  0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 0
#define HAVE_DECL_IBV_EXP_WR_NOP  0
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 1
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID   1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_INOTIFY_ADD_WATCH 1
#define HAVE_DECL_INOTIFY_INIT    1
#define HAVE_DECL_IN_ATTRIB       1
#define HAVE_DECL_IPPROTO_TCP     1
#define HAVE_DECL_MADV_FREE       1
#define HAVE_DECL_MADV_REMOVE     1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 1
#define HAVE_DECL_MLX5DV_CREATE_QP 1
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 1
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 1
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 1
#define HAVE_DECL_MLX5DV_OBJ_AH   1
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 1
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_BF 0
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_NC 0
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER  1
#define HAVE_DECL_SOL_SOCKET      1
#define HAVE_DECL_SO_KEEPALIVE    1
#define HAVE_DECL_SPEED_UNKNOWN   1
#define HAVE_DECL_STRERROR_R      1
#define HAVE_DECL_SYS_BRK         1
#define HAVE_DECL_SYS_IPC         0
#define HAVE_DECL_SYS_MADVISE     1
#define HAVE_DECL_SYS_MMAP        1
#define HAVE_DECL_SYS_MREMAP      1
#define HAVE_DECL_SYS_MUNMAP      1
#define HAVE_DECL_SYS_SHMAT       1
#define HAVE_DECL_SYS_SHMDT       1
#define HAVE_DECL_TCP_KEEPCNT     1
#define HAVE_DECL_TCP_KEEPIDLE    1
#define HAVE_DECL_TCP_KEEPINTVL   1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DEVX                 1
#define HAVE_DLFCN_H              1
#define HAVE_GDRAPI_H             1
#define HAVE_HW_TIMER             1
#define HAVE_IB                   1
#define HAVE_IBV_DM               1
#define HAVE_IN6_ADDR_S6_ADDR32   1
#define HAVE_INFINIBAND_MLX5DV_H  1
#define HAVE_INFINIBAND_TM_TYPES_H 1
#define HAVE_INOTIFY              1
#define HAVE_INTTYPES_H           1
#define HAVE_IP_IP_DST            1
#define HAVE_LIBGEN_H             1
#define HAVE_LIBRT                1
#define HAVE_LINUX_FUTEX_H        1
#define HAVE_LINUX_IP_H           1
#define HAVE_LINUX_MMAN_H         1
#define HAVE_MALLOC_H             1
#define HAVE_MALLOC_HOOK          1
#define HAVE_MALLOC_TRIM          1
#define HAVE_MEMALIGN             1
#define HAVE_MEMORY_H             1
#define HAVE_MLX5_HW              1
#define HAVE_MLX5_HW_UD           1
#define HAVE_MREMAP               1
#define HAVE_NETINET_IP_H         1
#define HAVE_NET_ETHERNET_H       1
#define HAVE_NUMA                 1
#define HAVE_NUMAIF_H             1
#define HAVE_NUMA_H               1
#define HAVE_ODP                  1
#define HAVE_ODP_IMPLICIT         1
#define HAVE_POSIX_MEMALIGN       1
#define HAVE_PREFETCH             1
#define HAVE_SCHED_GETAFFINITY    1
#define HAVE_SCHED_SETAFFINITY    1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T         1
#define HAVE_STDINT_H             1
#define HAVE_STDLIB_H             1
#define HAVE_STRERROR_R           1
#define HAVE_STRINGS_H            1
#define HAVE_STRING_H             1
#define HAVE_STRUCT_BITMASK       1
#define HAVE_STRUCT_DL_PHDR_INFO  1
#define HAVE_STRUCT_IBV_DEVICE_ATTR_EX_PCI_ATOMIC_CAPS 1
#define HAVE_STRUCT_IBV_TM_CAPS_FLAGS 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_SYS_EPOLL_H          1
#define HAVE_SYS_EVENTFD_H        1
#define HAVE_SYS_STAT_H           1
#define HAVE_SYS_TYPES_H          1
#define HAVE_SYS_UIO_H            1
#define HAVE_TL_DC                1
#define HAVE_TL_RC                1
#define HAVE_TL_UD                1
#define HAVE_UCM_PTMALLOC286      1
#define HAVE_UNISTD_H             1
#define HAVE___CLEAR_CACHE        1
#define HAVE___CURBRK             1
#define HAVE___SIGHANDLER_T       1
#define IBV_HW_TM                 1
#define LT_OBJDIR                 ".libs/"
#define NVALGRIND                 1
#define PACKAGE                   "ucx"
#define PACKAGE_BUGREPORT         ""
#define PACKAGE_NAME              "ucx"
#define PACKAGE_STRING            "ucx 1.11"
#define PACKAGE_TARNAME           "ucx"
#define PACKAGE_URL               ""
#define PACKAGE_VERSION           "1.11"
#define STDC_HEADERS              1
#define STRERROR_R_CHAR_P         1
#define UCM_BISTRO_HOOKS          1
#define UCS_MAX_LOG_LEVEL         UCS_LOG_LEVEL_DEBUG
#define UCT_TCP_EP_KEEPALIVE      1
#define UCT_UD_EP_DEBUG_HOOKS     0
#define UCX_CONFIGURE_FLAGS       "--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt"
#define UCX_MODULE_SUBDIR         "ucx"
#define VERSION                   "1.11"
#define restrict                  __restrict
#define test_MODULES              ":module"
#define ucg_MODULES               ":builtin"
#define ucm_MODULES               ":cuda"
#define ucs_MODULES               ""
#define uct_MODULES               ":cuda:ib:rdmacm:cma:xpmem"
#define uct_cuda_MODULES          ":gdrcopy"
#define uct_ib_MODULES            ""
#define uct_rocm_MODULES          ""
#define ucx_perftest_MODULES      ":cuda"
#
# Memory domain: posix
#     Component: posix
#             allocate: unlimited
#           remote key: 24 bytes
#           rkey_ptr is supported
#
#      Transport: posix
#         Device: memory
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: sysv
#     Component: sysv
#             allocate: unlimited
#           remote key: 12 bytes
#           rkey_ptr is supported
#
#      Transport: sysv
#         Device: memory
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 12179.00 MB/sec
#              latency: 80 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 100
#             am_bcopy: <= 8256
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: self
#     Component: self
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: self
#         Device: memory0
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 6911.00 MB/sec
#              latency: 0 nsec
#             overhead: 10 nsec
#            put_short: <= 4294967295
#            put_bcopy: unlimited
#            get_bcopy: unlimited
#             am_short: <= 8K
#             am_bcopy: <= 8K
#               domain: cpu
#           atomic_add: 32, 64 bit
#           atomic_and: 32, 64 bit
#            atomic_or: 32, 64 bit
#           atomic_xor: 32, 64 bit
#          atomic_fadd: 32, 64 bit
#          atomic_fand: 32, 64 bit
#           atomic_for: 32, 64 bit
#          atomic_fxor: 32, 64 bit
#          atomic_swap: 32, 64 bit
#         atomic_cswap: 32, 64 bit
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: ep_check
#
#
# Memory domain: tcp
#     Component: tcp
#             register: unlimited, cost: 0 nsec
#           remote key: 0 bytes
#
#      Transport: tcp
#         Device: eth4
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11316.36/ppn + 0.00 MB/sec
#              latency: 5206 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 0
#     device num paths: 1
#              max eps: 256
#       device address: 6 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#      Transport: tcp
#         Device: lo
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 11.91/ppn + 0.00 MB/sec
#              latency: 10960 nsec
#             overhead: 50000 nsec
#            put_zcopy: <= 18446744073709551590, up to 6 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 0
#             am_short: <= 8K
#             am_bcopy: <= 8K
#             am_zcopy: <= 64K, up to 6 iov
#   am_opt_zcopy_align: <= 1
#         am_align_mtu: <= 0
#            am header: <= 8037
#           connection: to ep, to iface
#      device priority: 1
#     device num paths: 1
#              max eps: 256
#       device address: 18 bytes
#        iface address: 2 bytes
#           ep address: 10 bytes
#       error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
#      max_conn_priv: 2064 bytes
#
# Memory domain: cuda_cpy
#     Component: cuda_cpy
#             allocate: unlimited
#             register: unlimited, cost: 0 nsec
#
#      Transport: cuda_copy
#         Device: cuda
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 10000.00/ppn + 0.00 MB/sec
#              latency: 8000 nsec
#             overhead: 0 nsec
#            put_short: <= 4294967295
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_short: <= 4294967295
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 0 bytes
#        iface address: 8 bytes
#       error handling: none
#
#
# Memory domain: cuda_ipc
#     Component: cuda_ipc
#             register: unlimited, cost: 0 nsec
#           remote key: 112 bytes
#
#      Transport: cuda_ipc
#         Device: cuda
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 300000.00/ppn + 0.00 MB/sec
#              latency: 1 nsec
#             overhead: 0 nsec
#            put_zcopy: unlimited, up to 1 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 1 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
# < failed to open connection manager rdmacm >
#
# Memory domain: cma
#     Component: cma
#             register: unlimited, cost: 9 nsec
#
#      Transport: cma
#         Device: memory
#  System device: <unknown>
#
#      capabilities:
#            bandwidth: 0.00/ppn + 11145.00 MB/sec
#              latency: 80 nsec
#             overhead: 400 nsec
#            put_zcopy: unlimited, up to 16 iov
#  put_opt_zcopy_align: <= 1
#        put_align_mtu: <= 1
#            get_zcopy: unlimited, up to 16 iov
#  get_opt_zcopy_align: <= 1
#        get_align_mtu: <= 1
#           connection: to iface
#      device priority: 0
#     device num paths: 1
#              max eps: inf
#       device address: 8 bytes
#        iface address: 4 bytes
#       error handling: peer failure, ep_check
#
UCX_LOG_LEVEL=WARN
UCX_LOG_FILE_FILTER=*
UCX_LOG_FILE=
UCX_LOG_FILE_SIZE=inf
UCX_LOG_FILE_ROTATE=0
UCX_LOG_BUFFER=1K
UCX_LOG_DATA_SIZE=0
UCX_LOG_PRINT_ENABLE=n
UCX_HANDLE_ERRORS=bt
UCX_ERROR_SIGNALS=ILL,SEGV,BUS,FPE
UCX_ERROR_MAIL_TO=
UCX_ERROR_MAIL_FOOTER=
UCX_GDB_COMMAND=gdb -quiet
UCX_DEBUG_SIGNO=HUP
UCX_LOG_LEVEL_TRIGGER=FATAL
UCX_WARN_UNUSED_ENV_VARS=y
UCX_ASYNC_MAX_EVENTS=1024
UCX_ASYNC_SIGNO=ALRM
UCX_VFS_ENABLE=y
UCX_PROFILE_MODE=
UCX_PROFILE_FILE=ucx_%h_%p.prof
UCX_PROFILE_LOG_SIZE=4M
UCX_RCACHE_CHECK_PFN=0
UCX_MODULE_DIR=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt/lib/ucx
UCX_MODULE_LOG_LEVEL=TRACE
UCX_BUILTIN_MEMCPY_MIN=auto
UCX_BUILTIN_MEMCPY_MAX=auto
UCX_MEM_LOG_LEVEL=WARN
UCX_MEM_ALLOC_ALIGN=16
UCX_MEM_EVENTS=y
UCX_MEM_MMAP_HOOK_MODE=bistro
UCX_MEM_MALLOC_HOOKS=y
UCX_MEM_MALLOC_RELOC=y
UCX_MEM_CUDA_HOOK_MODE=bistro
UCX_MEM_DYNAMIC_MMAP_THRESH=y
UCX_MEM_DLOPEN_PROCESS_RPATH=y
UCX_MEM_MODULE_UNLOAD_PREVENT_MODE=lazy
UCX_POSIX_HUGETLB_MODE=try
UCX_POSIX_DIR=/dev/shm
UCX_POSIX_USE_PROC_LINK=y
UCX_POSIX_ALLOC=md,mmap,heap
UCX_POSIX_FAILURE=DIAG
UCX_POSIX_MAX_NUM_EPS=inf
UCX_POSIX_BW=12179.00MBps
UCX_POSIX_FIFO_SIZE=64
UCX_POSIX_SEG_SIZE=8256
UCX_POSIX_FIFO_RELEASE_FACTOR=0.500
UCX_POSIX_RX_MAX_BUFS=-1
UCX_POSIX_RX_BUFS_GROW=512
UCX_POSIX_FIFO_HUGETLB=n
UCX_POSIX_FIFO_ELEM_SIZE=128
UCX_POSIX_FIFO_MAX_POLL=16
UCX_POSIX_ERROR_HANDLING=n
UCX_SYSV_HUGETLB_MODE=try
UCX_SYSV_ALLOC=md,mmap,heap
UCX_SYSV_FAILURE=DIAG
UCX_SYSV_MAX_NUM_EPS=inf
UCX_SYSV_BW=12179.00MBps
UCX_SYSV_FIFO_SIZE=64
UCX_SYSV_SEG_SIZE=8256
UCX_SYSV_FIFO_RELEASE_FACTOR=0.500
UCX_SYSV_RX_MAX_BUFS=-1
UCX_SYSV_RX_BUFS_GROW=512
UCX_SYSV_FIFO_HUGETLB=n
UCX_SYSV_FIFO_ELEM_SIZE=128
UCX_SYSV_FIFO_MAX_POLL=16
UCX_SYSV_ERROR_HANDLING=n
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_SELF_FAILURE=DIAG
UCX_SELF_MAX_NUM_EPS=inf
UCX_SELF_SEG_SIZE=8K
UCX_SELF_NUM_DEVICES=1
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_FAILURE=DIAG
UCX_TCP_MAX_NUM_EPS=256
UCX_TCP_TX_SEG_SIZE=8K
UCX_TCP_RX_SEG_SIZE=64K
UCX_TCP_MAX_IOV=6
UCX_TCP_SENDV_THRESH=2K
UCX_TCP_PREFER_DEFAULT=y
UCX_TCP_PUT_ENABLE=y
UCX_TCP_CONN_NB=n
UCX_TCP_MAX_POLL=16
UCX_TCP_MAX_CONN_RETRIES=25
UCX_TCP_NODELAY=y
UCX_TCP_SNDBUF=auto
UCX_TCP_RCVBUF=auto
UCX_TCP_SYN_CNT=auto
UCX_TCP_TX_MAX_BUFS=-1
UCX_TCP_TX_BUFS_GROW=8
UCX_TCP_RX_MAX_BUFS=-1
UCX_TCP_RX_BUFS_GROW=8
UCX_TCP_PORT_RANGE=0
UCX_TCP_KEEPIDLE=10000000.00us
UCX_TCP_KEEPCNT=3
UCX_TCP_KEEPINTVL=1000000.00us
UCX_TCP_CM_FAILURE=DIAG
UCX_TCP_CM_REUSEADDR=n
UCX_TCP_CM_PRIV_DATA_LEN=2K
UCX_TCP_CM_SNDBUF=auto
UCX_TCP_CM_RCVBUF=auto
UCX_TCP_CM_SYN_CNT=auto
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_TLS_PRIORITY=rdmacm,tcp,sockcm
UCX_SOCKADDR_AUX_TLS=ud
UCX_SELECT_DISTANCE_MD=cuda_cpy
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=0
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256K
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MULTI_LANE_MAX_RATIO=4.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=2
UCX_RNDV_SCHEME=auto
UCX_RKEY_PTR_SEG_SIZE=512K
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=auto
UCX_ATOMIC_MODE=guess
UCX_ADDRESS_DEBUG_INFO=n
UCX_MAX_WORKER_ADDRESS_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8K
UCX_TM_THRESH=1K
UCX_TM_MAX_BB_SIZE=1K
UCX_TM_FORCE_THRESH=8K
UCX_TM_SW_RNDV=n
UCX_NUM_EPS=auto
UCX_NUM_PPN=auto
UCX_RNDV_FRAG_SIZE=512K
UCX_RNDV_PIPELINE_SEND_THRESH=inf
UCX_MEMTYPE_CACHE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
UCX_CM_USE_ALL_DEVICES=y
UCX_LISTENER_BACKLOG=auto
UCX_PROTO_ENABLE=n
UCX_KEEPALIVE_INTERVAL=60000000.00us
UCX_KEEPALIVE_NUM_EPS=128
UCX_PROTO_INDIRECT_ID=auto
UCX_ERROR_HANDLER_DELAY=0.00us
UCX_CUDA_COPY_ALLOC=huge,thp,md,mmap,heap
UCX_CUDA_COPY_FAILURE=DIAG
UCX_CUDA_COPY_MAX_NUM_EPS=inf
UCX_CUDA_COPY_MAX_POLL=16
UCX_CUDA_COPY_MAX_EVENTS=inf
UCX_CUDA_IPC_ALLOC=huge,thp,md,mmap,heap
UCX_CUDA_IPC_FAILURE=DIAG
UCX_CUDA_IPC_MAX_NUM_EPS=inf
UCX_CUDA_IPC_MAX_POLL=16
UCX_CUDA_IPC_MAX_STREAMS=16
UCX_CUDA_IPC_CACHE=y
UCX_CUDA_IPC_MAX_EVENTS=inf
UCX_GDR_COPY_RCACHE=try
UCX_GDR_COPY_RCACHE_MEM_PRIO=1000
UCX_GDR_COPY_RCACHE_OVERHEAD=0.18us
UCX_GDR_COPY_RCACHE_ADDR_ALIGN=65536
UCX_GDR_COPY_RCACHE_MAX_REGIONS=inf
UCX_GDR_COPY_RCACHE_MAX_SIZE=inf
UCX_GDR_COPY_MEM_REG_OVERHEAD=16.00us
UCX_GDR_COPY_MEM_REG_GROWTH=0.00us
UCX_GDR_COPY_ALLOC=huge,thp,md,mmap,heap
UCX_GDR_COPY_FAILURE=DIAG
UCX_GDR_COPY_MAX_NUM_EPS=inf
UCX_IB_REG_METHODS=rcache,odp,direct
UCX_IB_RCACHE_MEM_PRIO=1000
UCX_IB_RCACHE_OVERHEAD=0.18us
UCX_IB_RCACHE_ADDR_ALIGN=16
UCX_IB_RCACHE_MAX_REGIONS=inf
UCX_IB_RCACHE_MAX_SIZE=inf
UCX_IB_MEM_REG_OVERHEAD=16.00us
UCX_IB_MEM_REG_GROWTH=0.00us
UCX_IB_FORK_INIT=try
UCX_IB_ASYNC_EVENTS=y
UCX_IB_ETH_PAUSE_ON=y
UCX_IB_ODP_NUMA_POLICY=preferred
UCX_IB_ODP_PREFETCH=n
UCX_IB_ODP_MAX_SIZE=auto
UCX_IB_DEVICE_SPECS=
UCX_IB_PREFER_NEAREST_DEVICE=y
UCX_IB_INDIRECT_ATOMIC=y
UCX_IB_GID_INDEX=auto
UCX_IB_SUBNET_PREFIX=
UCX_IB_GPU_DIRECT_RDMA=try
UCX_IB_PCI_BW=
UCX_IB_MLX5_DEVX=n
UCX_IB_MLX5_DEVX_OBJECTS=rcqp,rcsrq,dct,dcsrq,dci
UCX_IB_REG_MT_THRESH=4G
UCX_IB_REG_MT_CHUNK=2G
UCX_IB_REG_MT_BIND=n
UCX_IB_PCI_RELAXED_ORDERING=auto
UCX_RC_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_RC_VERBS_FAILURE=DIAG
UCX_RC_VERBS_MAX_NUM_EPS=256
UCX_RC_VERBS_SEG_SIZE=8256
UCX_RC_VERBS_TX_QUEUE_LEN=256
UCX_RC_VERBS_TX_MAX_BATCH=16
UCX_RC_VERBS_TX_MAX_POLL=16
UCX_RC_VERBS_TX_MIN_INLINE=64
UCX_RC_VERBS_TX_INLINE_RESP=64
UCX_RC_VERBS_TX_MIN_SGE=4
UCX_RC_VERBS_TX_MAX_BUFS=-1
UCX_RC_VERBS_TX_BUFS_GROW=1024
UCX_RC_VERBS_RX_QUEUE_LEN=4095
UCX_RC_VERBS_RX_MAX_BATCH=16
UCX_RC_VERBS_RX_MAX_POLL=16
UCX_RC_VERBS_RX_INLINE=64
UCX_RC_VERBS_RX_MAX_BUFS=-1
UCX_RC_VERBS_RX_BUFS_GROW=0
UCX_RC_VERBS_ADDR_TYPE=auto
UCX_RC_VERBS_IS_GLOBAL=n
UCX_RC_VERBS_SL=auto
UCX_RC_VERBS_TRAFFIC_CLASS=auto
UCX_RC_VERBS_HOP_LIMIT=255
UCX_RC_VERBS_NUM_PATHS=auto
UCX_RC_VERBS_ROCE_LOCAL_SUBNET=n
UCX_RC_VERBS_ROCE_PATH_FACTOR=1
UCX_RC_VERBS_LID_PATH_BITS=0
UCX_RC_VERBS_PKEY=auto
UCX_RC_VERBS_PATH_MTU=default
UCX_RC_VERBS_MAX_RD_ATOMIC=4
UCX_RC_VERBS_TIMEOUT=1000000.00us
UCX_RC_VERBS_RETRY_COUNT=7
UCX_RC_VERBS_RNR_TIMEOUT=1000.00us
UCX_RC_VERBS_RNR_RETRY_COUNT=7
UCX_RC_VERBS_FC_ENABLE=y
UCX_RC_VERBS_FC_WND_SIZE=512
UCX_RC_VERBS_FC_HARD_THRESH=0.250
UCX_RC_VERBS_FENCE=auto
UCX_RC_VERBS_MAX_GET_ZCOPY=auto
UCX_RC_VERBS_TX_NUM_GET_BYTES=inf
UCX_RC_VERBS_TX_POLL_ALWAYS=n
UCX_RC_VERBS_FC_SOFT_THRESH=0.500
UCX_RC_VERBS_TX_CQ_MODERATION=64
UCX_RC_VERBS_TX_CQ_LEN=4096
UCX_RC_VERBS_MAX_AM_HDR=128
UCX_RC_VERBS_TX_MAX_WR=inf
UCX_RC_VERBS_FLUSH_MODE=auto
UCX_RC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_RC_MLX5_FAILURE=DIAG
UCX_RC_MLX5_MAX_NUM_EPS=256
UCX_RC_MLX5_SEG_SIZE=8256
UCX_RC_MLX5_TX_QUEUE_LEN=256
UCX_RC_MLX5_TX_MAX_BATCH=16
UCX_RC_MLX5_TX_MAX_POLL=16
UCX_RC_MLX5_TX_MIN_INLINE=64
UCX_RC_MLX5_TX_INLINE_RESP=64
UCX_RC_MLX5_TX_MIN_SGE=4
UCX_RC_MLX5_TX_MAX_BUFS=-1
UCX_RC_MLX5_TX_BUFS_GROW=1024
UCX_RC_MLX5_RX_QUEUE_LEN=4095
UCX_RC_MLX5_RX_MAX_BATCH=16
UCX_RC_MLX5_RX_MAX_POLL=16
UCX_RC_MLX5_RX_INLINE=64
UCX_RC_MLX5_RX_MAX_BUFS=-1
UCX_RC_MLX5_RX_BUFS_GROW=0
UCX_RC_MLX5_ADDR_TYPE=auto
UCX_RC_MLX5_IS_GLOBAL=n
UCX_RC_MLX5_SL=auto
UCX_RC_MLX5_TRAFFIC_CLASS=auto
UCX_RC_MLX5_HOP_LIMIT=255
UCX_RC_MLX5_NUM_PATHS=auto
UCX_RC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_RC_MLX5_ROCE_PATH_FACTOR=1
UCX_RC_MLX5_LID_PATH_BITS=0
UCX_RC_MLX5_PKEY=auto
UCX_RC_MLX5_PATH_MTU=default
UCX_RC_MLX5_MAX_RD_ATOMIC=4
UCX_RC_MLX5_TIMEOUT=1000000.00us
UCX_RC_MLX5_RETRY_COUNT=7
UCX_RC_MLX5_RNR_TIMEOUT=1000.00us
UCX_RC_MLX5_RNR_RETRY_COUNT=7
UCX_RC_MLX5_FC_ENABLE=y
UCX_RC_MLX5_FC_WND_SIZE=512
UCX_RC_MLX5_FC_HARD_THRESH=0.250
UCX_RC_MLX5_FENCE=auto
UCX_RC_MLX5_MAX_GET_ZCOPY=auto
UCX_RC_MLX5_TX_NUM_GET_BYTES=inf
UCX_RC_MLX5_TX_POLL_ALWAYS=n
UCX_RC_MLX5_FC_SOFT_THRESH=0.500
UCX_RC_MLX5_TX_CQ_MODERATION=64
UCX_RC_MLX5_TX_CQ_LEN=4096
UCX_RC_MLX5_DM_SIZE=2K
UCX_RC_MLX5_DM_COUNT=1
UCX_RC_MLX5_MMIO_MODE=auto
UCX_RC_MLX5_AR_ENABLE=auto
UCX_RC_MLX5_TX_MAX_BB=inf
UCX_RC_MLX5_TM_ENABLE=n
UCX_RC_MLX5_TM_LIST_SIZE=1024
UCX_RC_MLX5_TM_SEG_SIZE=48K
UCX_RC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_RC_MLX5_TM_MP_NUM_STRIDES=8
UCX_RC_MLX5_EXP_BACKOFF=0
UCX_RC_MLX5_SRQ_TOPO=cyclic,cyclic_emulated
UCX_DC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_DC_MLX5_FAILURE=DIAG
UCX_DC_MLX5_MAX_NUM_EPS=inf
UCX_DC_MLX5_SEG_SIZE=8256
UCX_DC_MLX5_TX_QUEUE_LEN=128
UCX_DC_MLX5_TX_MAX_BATCH=16
UCX_DC_MLX5_TX_MAX_POLL=16
UCX_DC_MLX5_TX_MIN_INLINE=64
UCX_DC_MLX5_TX_INLINE_RESP=64
UCX_DC_MLX5_TX_MIN_SGE=4
UCX_DC_MLX5_TX_MAX_BUFS=-1
UCX_DC_MLX5_TX_BUFS_GROW=1024
UCX_DC_MLX5_RX_QUEUE_LEN=4095
UCX_DC_MLX5_RX_MAX_BATCH=16
UCX_DC_MLX5_RX_MAX_POLL=16
UCX_DC_MLX5_RX_INLINE=64
UCX_DC_MLX5_RX_MAX_BUFS=-1
UCX_DC_MLX5_RX_BUFS_GROW=0
UCX_DC_MLX5_ADDR_TYPE=auto
UCX_DC_MLX5_IS_GLOBAL=n
UCX_DC_MLX5_SL=auto
UCX_DC_MLX5_TRAFFIC_CLASS=auto
UCX_DC_MLX5_HOP_LIMIT=255
UCX_DC_MLX5_NUM_PATHS=auto
UCX_DC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_DC_MLX5_ROCE_PATH_FACTOR=1
UCX_DC_MLX5_LID_PATH_BITS=0
UCX_DC_MLX5_PKEY=auto
UCX_DC_MLX5_PATH_MTU=default
UCX_DC_MLX5_MAX_RD_ATOMIC=4
UCX_DC_MLX5_TIMEOUT=1000000.00us
UCX_DC_MLX5_RETRY_COUNT=7
UCX_DC_MLX5_RNR_TIMEOUT=1000.00us
UCX_DC_MLX5_RNR_RETRY_COUNT=7
UCX_DC_MLX5_FC_ENABLE=y
UCX_DC_MLX5_FC_WND_SIZE=512
UCX_DC_MLX5_FC_HARD_THRESH=0.250
UCX_DC_MLX5_FENCE=auto
UCX_DC_MLX5_MAX_GET_ZCOPY=auto
UCX_DC_MLX5_TX_NUM_GET_BYTES=inf
UCX_DC_MLX5_TX_POLL_ALWAYS=n
UCX_DC_MLX5_DM_SIZE=2K
UCX_DC_MLX5_DM_COUNT=1
UCX_DC_MLX5_MMIO_MODE=auto
UCX_DC_MLX5_AR_ENABLE=auto
UCX_DC_MLX5_TX_MAX_BB=inf
UCX_DC_MLX5_TM_ENABLE=n
UCX_DC_MLX5_TM_LIST_SIZE=1024
UCX_DC_MLX5_TM_SEG_SIZE=48K
UCX_DC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_DC_MLX5_TM_MP_NUM_STRIDES=8
UCX_DC_MLX5_EXP_BACKOFF=0
UCX_DC_MLX5_SRQ_TOPO=list
UCX_DC_MLX5_RX_QUEUE_LEN_INIT=128
UCX_DC_MLX5_NUM_DCI=8
UCX_DC_MLX5_TX_POLICY=dcs_quota
UCX_DC_MLX5_DCI_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCI_KA_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCT_FULL_HANDSHAKE=n
UCX_DC_MLX5_RAND_DCI_SEED=0
UCX_DC_MLX5_QUOTA=32
UCX_DC_MLX5_FC_HARD_REQ_TIMEOUT=5000000.00us
UCX_DC_MLX5_COMPACT_AV=y
UCX_UD_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_UD_VERBS_FAILURE=DIAG
UCX_UD_VERBS_MAX_NUM_EPS=inf
UCX_UD_VERBS_SEG_SIZE=8K
UCX_UD_VERBS_TX_QUEUE_LEN=256
UCX_UD_VERBS_TX_MAX_BATCH=16
UCX_UD_VERBS_TX_MAX_POLL=16
UCX_UD_VERBS_TX_MIN_INLINE=64
UCX_UD_VERBS_TX_INLINE_RESP=0
UCX_UD_VERBS_TX_MIN_SGE=4
UCX_UD_VERBS_TX_MAX_BUFS=-1
UCX_UD_VERBS_TX_BUFS_GROW=1024
UCX_UD_VERBS_RX_QUEUE_LEN=4096
UCX_UD_VERBS_RX_MAX_BATCH=16
UCX_UD_VERBS_RX_MAX_POLL=16
UCX_UD_VERBS_RX_INLINE=0
UCX_UD_VERBS_RX_MAX_BUFS=-1
UCX_UD_VERBS_RX_BUFS_GROW=0
UCX_UD_VERBS_ADDR_TYPE=auto
UCX_UD_VERBS_IS_GLOBAL=n
UCX_UD_VERBS_SL=auto
UCX_UD_VERBS_TRAFFIC_CLASS=auto
UCX_UD_VERBS_HOP_LIMIT=255
UCX_UD_VERBS_NUM_PATHS=auto
UCX_UD_VERBS_ROCE_LOCAL_SUBNET=n
UCX_UD_VERBS_ROCE_PATH_FACTOR=1
UCX_UD_VERBS_LID_PATH_BITS=0
UCX_UD_VERBS_PKEY=auto
UCX_UD_VERBS_PATH_MTU=default
UCX_UD_VERBS_RX_QUEUE_LEN_INIT=128
UCX_UD_VERBS_TIMEOUT=300000000.00us
UCX_UD_VERBS_TIMER_TICK=10000.00us
UCX_UD_VERBS_TIMER_BACKOFF=2.000
UCX_UD_VERBS_ASYNC_TIMER_TICK=100000.00us
UCX_UD_VERBS_MIN_POKE_TIME=250000.00us
UCX_UD_VERBS_ETH_DGID_CHECK=y
UCX_UD_VERBS_MAX_WINDOW=1025
UCX_UD_VERBS_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_UD_MLX5_FAILURE=DIAG
UCX_UD_MLX5_MAX_NUM_EPS=inf
UCX_UD_MLX5_SEG_SIZE=8K
UCX_UD_MLX5_TX_QUEUE_LEN=256
UCX_UD_MLX5_TX_MAX_BATCH=16
UCX_UD_MLX5_TX_MAX_POLL=16
UCX_UD_MLX5_TX_MIN_INLINE=64
UCX_UD_MLX5_TX_INLINE_RESP=0
UCX_UD_MLX5_TX_MIN_SGE=4
UCX_UD_MLX5_TX_MAX_BUFS=-1
UCX_UD_MLX5_TX_BUFS_GROW=1024
UCX_UD_MLX5_RX_QUEUE_LEN=4096
UCX_UD_MLX5_RX_MAX_BATCH=16
UCX_UD_MLX5_RX_MAX_POLL=16
UCX_UD_MLX5_RX_INLINE=0
UCX_UD_MLX5_RX_MAX_BUFS=-1
UCX_UD_MLX5_RX_BUFS_GROW=0
UCX_UD_MLX5_ADDR_TYPE=auto
UCX_UD_MLX5_IS_GLOBAL=n
UCX_UD_MLX5_SL=auto
UCX_UD_MLX5_TRAFFIC_CLASS=auto
UCX_UD_MLX5_HOP_LIMIT=255
UCX_UD_MLX5_NUM_PATHS=auto
UCX_UD_MLX5_ROCE_LOCAL_SUBNET=n
UCX_UD_MLX5_ROCE_PATH_FACTOR=1
UCX_UD_MLX5_LID_PATH_BITS=0
UCX_UD_MLX5_PKEY=auto
UCX_UD_MLX5_PATH_MTU=default
UCX_UD_MLX5_RX_QUEUE_LEN_INIT=128
UCX_UD_MLX5_TIMEOUT=300000000.00us
UCX_UD_MLX5_TIMER_TICK=10000.00us
UCX_UD_MLX5_TIMER_BACKOFF=2.000
UCX_UD_MLX5_ASYNC_TIMER_TICK=100000.00us
UCX_UD_MLX5_MIN_POKE_TIME=250000.00us
UCX_UD_MLX5_ETH_DGID_CHECK=y
UCX_UD_MLX5_MAX_WINDOW=1025
UCX_UD_MLX5_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_DM_SIZE=2K
UCX_UD_MLX5_DM_COUNT=1
UCX_UD_MLX5_MMIO_MODE=auto
UCX_UD_MLX5_AR_ENABLE=auto
UCX_UD_MLX5_COMPACT_AV=y
UCX_RDMA_CM_FAILURE=DIAG
UCX_RDMA_CM_REUSEADDR=n
UCX_RDMA_CM_SOURCE_ADDRESS=
UCX_RDMA_CM_TIMEOUT=10000000.00us
UCX_RDMA_CM_RESERVED_QPN=try
UCX_CMA_ALLOC=huge,thp,mmap,heap
UCX_CMA_FAILURE=DIAG
UCX_CMA_MAX_NUM_EPS=inf
UCX_CMA_BW=11145.00MBps
UCX_CMA_MAX_IOV=16
UCX_CMA_SEG_SIZE=512K
UCX_CMA_TX_QUOTA=1
UCX_CMA_TX_MAX_BUFS=-1
UCX_CMA_TX_BUFS_GROW=8
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox

[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ofed_info

pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: execve(): ofed_info: No such file or directory
srun: error: node001: task 0: Exited with exit code 2
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: execve(): ofed_info: No such file or directory
srun: error: node002: task 1: Exited with exit code 2
[root@bright88 mxnet]#
Artemy-Mellanox commented 1 year ago

@karanveersingh5623 need a bit more info. could you please run those commands and attach the output


srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area gdb -ex r -ex "info sharedlibrary" -ex q --args "$(which ucx_info)" -c
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibv_devinfo
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area lspci
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibstat
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ls -la /sys/class/infiniband
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox

[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area gdb -ex r -ex "info sharedlibrary" -ex q --args "$(which ucx_info)" -c
/usr/bin/which: no ucx_info in (/cm/shared/apps/slurm/current/sbin:/cm/shared/apps/slurm/current/bin:/cm/local/apps/cm-setup/bin:/cm/local/apps/cluster-tools/bin:/cm/local/apps/cmd/sbin:/cm/local/apps/cmd/bin:/cm/local/apps/environment-modules/4.5.3//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/cm/local/apps/environment-modules/4.5.3/bin:/bin:/sbin:/opt/dell/srvadmin/bin:/opt/dell/srvadmin/sbin:/root/bin)
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: execve(): gdb: No such file or directory
srun: error: node001: task 0: Exited with exit code 2
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibv_devinfo
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
No IB devices found
srun: error: node001: task 0: Exited with exit code 255
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibstat
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.1014
        Hardware version: 0
        Node GUID: 0x043f720300dc0684
        System image GUID: 0x043f720300dc0684
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x063f72fffedc0684
                Link layer: Ethernet
CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.1014
        Hardware version: 0
        Node GUID: 0x043f720300dc0685
        System image GUID: 0x043f720300dc0684
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x063f72fffedc0685
                Link layer: Ethernet
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ls -la /sys/class/infiniband
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
total 0
drwxr-xr-x  2 root root 0 Nov 15 18:30 .
drwxr-xr-x 91 root root 0 Nov 15 12:34 ..
lrwxrwxrwx  1 root root 0 Nov 15 18:30 mlx5_0 -> ../../devices/pci0000:97/0000:97:02.0/0000:98:00.0/infiniband/mlx5_0
lrwxrwxrwx  1 root root 0 Nov 15 18:30 mlx5_1 -> ../../devices/pci0000:97/0000:97:02.0/0000:98:00.1/infiniband/mlx5_1
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox

[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area lspci
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
00:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
00:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
00:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
00:00.4 Host bridge: Intel Corporation Device 0998
00:02.0 System peripheral: Intel Corporation Device 09a6
00:02.1 System peripheral: Intel Corporation Device 09a7
00:02.4 Non-Essential Instrumentation [1300]: Intel Corporation Device 3456 (rev 01)
00:11.0 Unassigned class [ff00]: Intel Corporation C620 Series Chipset Family MROM 0 (rev 0a)
00:11.5 SATA controller: Intel Corporation C620 Series Chipset Family SSATA Controller [AHCI mode] (rev 0a)
00:14.0 USB controller: Intel Corporation C620 Series Chipset Family USB 3.0 xHCI Controller (rev 0a)
00:14.2 Signal processing controller: Intel Corporation C620 Series Chipset Family Thermal Subsystem (rev 0a)
00:16.0 Communication controller: Intel Corporation C620 Series Chipset Family MEI Controller #1 (rev 0a)
00:16.1 Communication controller: Intel Corporation C620 Series Chipset Family MEI Controller #2 (rev 0a)
00:16.4 Communication controller: Intel Corporation C620 Series Chipset Family MEI Controller #3 (rev 0a)
00:17.0 SATA controller: Intel Corporation C620 Series Chipset Family SATA Controller [AHCI mode] (rev 0a)
00:1c.0 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #1 (rev fa)
00:1c.4 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #5 (rev fa)
00:1c.5 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #6 (rev fa)
00:1d.0 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #9 (rev fa)
00:1f.0 ISA bridge: Intel Corporation Device a1cb (rev 0a)
00:1f.2 Memory controller: Intel Corporation C620 Series Chipset Family Power Management Controller (rev 0a)
00:1f.4 SMBus: Intel Corporation C620 Series Chipset Family SMBus (rev 0a)
00:1f.5 Serial bus controller [0c80]: Intel Corporation C620 Series Chipset Family SPI Controller (rev 0a)
02:00.0 PCI bridge: PLDA PCI Express Bridge (rev 02)
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
05:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller (rev 11)
16:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
16:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
16:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
16:00.4 Host bridge: Intel Corporation Device 0998
16:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
17:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
30:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
30:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
30:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
30:00.4 Host bridge: Intel Corporation Device 0998
30:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
30:03.0 PCI bridge: Intel Corporation Device 347b (rev 04)
30:04.0 PCI bridge: Intel Corporation Device 347c (rev 04)
31:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
33:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
33:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
33:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
33:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
4a:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
4a:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
4a:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
4a:00.4 Host bridge: Intel Corporation Device 0998
64:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
64:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
64:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
64:00.4 Host bridge: Intel Corporation Device 0998
64:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
65:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
7e:00.0 System peripheral: Intel Corporation Device 3450
7e:00.1 System peripheral: Intel Corporation Device 3451
7e:00.2 System peripheral: Intel Corporation Device 3452
7e:00.3 Host bridge: Intel Corporation Device 0998
7e:00.5 System peripheral: Intel Corporation Device 3455
7e:02.0 System peripheral: Intel Corporation Device 3440
7e:02.1 System peripheral: Intel Corporation Device 3441
7e:02.2 System peripheral: Intel Corporation Device 3442
7e:03.0 System peripheral: Intel Corporation Device 3440
7e:03.1 System peripheral: Intel Corporation Device 3441
7e:03.2 System peripheral: Intel Corporation Device 3442
7e:04.0 System peripheral: Intel Corporation Device 3440
7e:04.1 System peripheral: Intel Corporation Device 3441
7e:04.2 System peripheral: Intel Corporation Device 3442
7e:04.3 System peripheral: Intel Corporation Device 3443
7e:05.0 System peripheral: Intel Corporation Device 3445
7e:05.1 System peripheral: Intel Corporation Device 3446
7e:05.2 System peripheral: Intel Corporation Device 3447
7e:06.0 System peripheral: Intel Corporation Device 3445
7e:06.1 System peripheral: Intel Corporation Device 3446
7e:06.2 System peripheral: Intel Corporation Device 3447
7e:07.0 System peripheral: Intel Corporation Device 3445
7e:07.1 System peripheral: Intel Corporation Device 3446
7e:07.2 System peripheral: Intel Corporation Device 3447
7e:0b.0 System peripheral: Intel Corporation Device 3448
7e:0b.1 System peripheral: Intel Corporation Device 3448
7e:0b.2 System peripheral: Intel Corporation Device 344b
7e:0c.0 Performance counters: Intel Corporation Device 344a
7e:0d.0 Performance counters: Intel Corporation Device 344a
7e:0e.0 Performance counters: Intel Corporation Device 344a
7e:0f.0 Performance counters: Intel Corporation Device 344a
7e:1a.0 Performance counters: Intel Corporation Device 2880
7e:1b.0 Performance counters: Intel Corporation Device 2880
7e:1c.0 Performance counters: Intel Corporation Device 2880
7e:1d.0 Performance counters: Intel Corporation Device 2880
7f:00.0 System peripheral: Intel Corporation Device 344c
7f:00.1 System peripheral: Intel Corporation Device 344c
7f:00.2 System peripheral: Intel Corporation Device 344c
7f:00.3 System peripheral: Intel Corporation Device 344c
7f:00.4 System peripheral: Intel Corporation Device 344c
7f:00.5 System peripheral: Intel Corporation Device 344c
7f:00.6 System peripheral: Intel Corporation Device 344c
7f:00.7 System peripheral: Intel Corporation Device 344c
7f:01.0 System peripheral: Intel Corporation Device 344c
7f:01.1 System peripheral: Intel Corporation Device 344c
7f:01.2 System peripheral: Intel Corporation Device 344c
7f:01.3 System peripheral: Intel Corporation Device 344c
7f:01.4 System peripheral: Intel Corporation Device 344c
7f:01.5 System peripheral: Intel Corporation Device 344c
7f:01.6 System peripheral: Intel Corporation Device 344c
7f:01.7 System peripheral: Intel Corporation Device 344c
7f:02.0 System peripheral: Intel Corporation Device 344c
7f:02.1 System peripheral: Intel Corporation Device 344c
7f:02.2 System peripheral: Intel Corporation Device 344c
7f:02.3 System peripheral: Intel Corporation Device 344c
7f:02.4 System peripheral: Intel Corporation Device 344c
7f:02.5 System peripheral: Intel Corporation Device 344c
7f:02.6 System peripheral: Intel Corporation Device 344c
7f:02.7 System peripheral: Intel Corporation Device 344c
7f:03.0 System peripheral: Intel Corporation Device 344c
7f:03.1 System peripheral: Intel Corporation Device 344c
7f:03.2 System peripheral: Intel Corporation Device 344c
7f:03.3 System peripheral: Intel Corporation Device 344c
7f:03.4 System peripheral: Intel Corporation Device 344c
7f:03.5 System peripheral: Intel Corporation Device 344c
7f:03.6 System peripheral: Intel Corporation Device 344c
7f:03.7 System peripheral: Intel Corporation Device 344c
7f:04.0 System peripheral: Intel Corporation Device 344c
7f:04.1 System peripheral: Intel Corporation Device 344c
7f:04.2 System peripheral: Intel Corporation Device 344c
7f:04.3 System peripheral: Intel Corporation Device 344c
7f:04.4 System peripheral: Intel Corporation Device 344c
7f:04.5 System peripheral: Intel Corporation Device 344c
7f:04.6 System peripheral: Intel Corporation Device 344c
7f:04.7 System peripheral: Intel Corporation Device 344c
7f:0a.0 System peripheral: Intel Corporation Device 344d
7f:0a.1 System peripheral: Intel Corporation Device 344d
7f:0a.2 System peripheral: Intel Corporation Device 344d
7f:0a.3 System peripheral: Intel Corporation Device 344d
7f:0a.4 System peripheral: Intel Corporation Device 344d
7f:0a.5 System peripheral: Intel Corporation Device 344d
7f:0a.6 System peripheral: Intel Corporation Device 344d
7f:0a.7 System peripheral: Intel Corporation Device 344d
7f:0b.0 System peripheral: Intel Corporation Device 344d
7f:0b.1 System peripheral: Intel Corporation Device 344d
7f:0b.2 System peripheral: Intel Corporation Device 344d
7f:0b.3 System peripheral: Intel Corporation Device 344d
7f:0b.4 System peripheral: Intel Corporation Device 344d
7f:0b.5 System peripheral: Intel Corporation Device 344d
7f:0b.6 System peripheral: Intel Corporation Device 344d
7f:0b.7 System peripheral: Intel Corporation Device 344d
7f:0c.0 System peripheral: Intel Corporation Device 344d
7f:0c.1 System peripheral: Intel Corporation Device 344d
7f:0c.2 System peripheral: Intel Corporation Device 344d
7f:0c.3 System peripheral: Intel Corporation Device 344d
7f:0c.4 System peripheral: Intel Corporation Device 344d
7f:0c.5 System peripheral: Intel Corporation Device 344d
7f:0c.6 System peripheral: Intel Corporation Device 344d
7f:0c.7 System peripheral: Intel Corporation Device 344d
7f:0d.0 System peripheral: Intel Corporation Device 344d
7f:0d.1 System peripheral: Intel Corporation Device 344d
7f:0d.2 System peripheral: Intel Corporation Device 344d
7f:0d.3 System peripheral: Intel Corporation Device 344d
7f:0d.4 System peripheral: Intel Corporation Device 344d
7f:0d.5 System peripheral: Intel Corporation Device 344d
7f:0d.6 System peripheral: Intel Corporation Device 344d
7f:0d.7 System peripheral: Intel Corporation Device 344d
7f:0e.0 System peripheral: Intel Corporation Device 344d
7f:0e.1 System peripheral: Intel Corporation Device 344d
7f:0e.2 System peripheral: Intel Corporation Device 344d
7f:0e.3 System peripheral: Intel Corporation Device 344d
7f:0e.4 System peripheral: Intel Corporation Device 344d
7f:0e.5 System peripheral: Intel Corporation Device 344d
7f:0e.6 System peripheral: Intel Corporation Device 344d
7f:0e.7 System peripheral: Intel Corporation Device 344d
7f:1d.0 System peripheral: Intel Corporation Device 344f
7f:1d.1 System peripheral: Intel Corporation Device 3457
7f:1e.0 System peripheral: Intel Corporation Device 3458 (rev 06)
7f:1e.1 System peripheral: Intel Corporation Device 3459 (rev 06)
7f:1e.2 System peripheral: Intel Corporation Device 345a (rev 06)
7f:1e.3 System peripheral: Intel Corporation Device 345b (rev 06)
7f:1e.4 System peripheral: Intel Corporation Device 345c (rev 06)
7f:1e.5 System peripheral: Intel Corporation Device 345d (rev 06)
7f:1e.6 System peripheral: Intel Corporation Device 345e (rev 06)
7f:1e.7 System peripheral: Intel Corporation Device 345f (rev 06)
80:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
80:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
80:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
80:00.4 Host bridge: Intel Corporation Device 0998
80:02.0 System peripheral: Intel Corporation Device 09a6
80:02.1 System peripheral: Intel Corporation Device 09a7
80:02.4 Non-Essential Instrumentation [1300]: Intel Corporation Device 3456 (rev 01)
97:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
97:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
97:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
97:00.4 Host bridge: Intel Corporation Device 0998
97:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
98:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
98:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
b0:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
b0:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
b0:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
b0:00.4 Host bridge: Intel Corporation Device 0998
c9:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
c9:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
c9:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
c9:00.4 Host bridge: Intel Corporation Device 0998
c9:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
ca:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
e2:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
e2:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
e2:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
e2:00.4 Host bridge: Intel Corporation Device 0998
e2:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
e3:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
fe:00.0 System peripheral: Intel Corporation Device 3450
fe:00.1 System peripheral: Intel Corporation Device 3451
fe:00.2 System peripheral: Intel Corporation Device 3452
fe:00.3 Host bridge: Intel Corporation Device 0998
fe:00.5 System peripheral: Intel Corporation Device 3455
fe:02.0 System peripheral: Intel Corporation Device 3440
fe:02.1 System peripheral: Intel Corporation Device 3441
fe:02.2 System peripheral: Intel Corporation Device 3442
fe:03.0 System peripheral: Intel Corporation Device 3440
fe:03.1 System peripheral: Intel Corporation Device 3441
fe:03.2 System peripheral: Intel Corporation Device 3442
fe:04.0 System peripheral: Intel Corporation Device 3440
fe:04.1 System peripheral: Intel Corporation Device 3441
fe:04.2 System peripheral: Intel Corporation Device 3442
fe:04.3 System peripheral: Intel Corporation Device 3443
fe:05.0 System peripheral: Intel Corporation Device 3445
fe:05.1 System peripheral: Intel Corporation Device 3446
fe:05.2 System peripheral: Intel Corporation Device 3447
fe:06.0 System peripheral: Intel Corporation Device 3445
fe:06.1 System peripheral: Intel Corporation Device 3446
fe:06.2 System peripheral: Intel Corporation Device 3447
fe:07.0 System peripheral: Intel Corporation Device 3445
fe:07.1 System peripheral: Intel Corporation Device 3446
fe:07.2 System peripheral: Intel Corporation Device 3447
fe:0b.0 System peripheral: Intel Corporation Device 3448
fe:0b.1 System peripheral: Intel Corporation Device 3448
fe:0b.2 System peripheral: Intel Corporation Device 344b
fe:0c.0 Performance counters: Intel Corporation Device 344a
fe:0d.0 Performance counters: Intel Corporation Device 344a
fe:0e.0 Performance counters: Intel Corporation Device 344a
fe:0f.0 Performance counters: Intel Corporation Device 344a
fe:1a.0 Performance counters: Intel Corporation Device 2880
fe:1b.0 Performance counters: Intel Corporation Device 2880
fe:1c.0 Performance counters: Intel Corporation Device 2880
fe:1d.0 Performance counters: Intel Corporation Device 2880
ff:00.0 System peripheral: Intel Corporation Device 344c
ff:00.1 System peripheral: Intel Corporation Device 344c
ff:00.2 System peripheral: Intel Corporation Device 344c
ff:00.3 System peripheral: Intel Corporation Device 344c
ff:00.4 System peripheral: Intel Corporation Device 344c
ff:00.5 System peripheral: Intel Corporation Device 344c
ff:00.6 System peripheral: Intel Corporation Device 344c
ff:00.7 System peripheral: Intel Corporation Device 344c
ff:01.0 System peripheral: Intel Corporation Device 344c
ff:01.1 System peripheral: Intel Corporation Device 344c
ff:01.2 System peripheral: Intel Corporation Device 344c
ff:01.3 System peripheral: Intel Corporation Device 344c
ff:01.4 System peripheral: Intel Corporation Device 344c
ff:01.5 System peripheral: Intel Corporation Device 344c
ff:01.6 System peripheral: Intel Corporation Device 344c
ff:01.7 System peripheral: Intel Corporation Device 344c
ff:02.0 System peripheral: Intel Corporation Device 344c
ff:02.1 System peripheral: Intel Corporation Device 344c
ff:02.2 System peripheral: Intel Corporation Device 344c
ff:02.3 System peripheral: Intel Corporation Device 344c
ff:02.4 System peripheral: Intel Corporation Device 344c
ff:02.5 System peripheral: Intel Corporation Device 344c
ff:02.6 System peripheral: Intel Corporation Device 344c
ff:02.7 System peripheral: Intel Corporation Device 344c
ff:03.0 System peripheral: Intel Corporation Device 344c
ff:03.1 System peripheral: Intel Corporation Device 344c
ff:03.2 System peripheral: Intel Corporation Device 344c
ff:03.3 System peripheral: Intel Corporation Device 344c
ff:03.4 System peripheral: Intel Corporation Device 344c
ff:03.5 System peripheral: Intel Corporation Device 344c
ff:03.6 System peripheral: Intel Corporation Device 344c
ff:03.7 System peripheral: Intel Corporation Device 344c
ff:04.0 System peripheral: Intel Corporation Device 344c
ff:04.1 System peripheral: Intel Corporation Device 344c
ff:04.2 System peripheral: Intel Corporation Device 344c
ff:04.3 System peripheral: Intel Corporation Device 344c
ff:04.4 System peripheral: Intel Corporation Device 344c
ff:04.5 System peripheral: Intel Corporation Device 344c
ff:04.6 System peripheral: Intel Corporation Device 344c
ff:04.7 System peripheral: Intel Corporation Device 344c
ff:0a.0 System peripheral: Intel Corporation Device 344d
ff:0a.1 System peripheral: Intel Corporation Device 344d
ff:0a.2 System peripheral: Intel Corporation Device 344d
ff:0a.3 System peripheral: Intel Corporation Device 344d
ff:0a.4 System peripheral: Intel Corporation Device 344d
ff:0a.5 System peripheral: Intel Corporation Device 344d
ff:0a.6 System peripheral: Intel Corporation Device 344d
ff:0a.7 System peripheral: Intel Corporation Device 344d
ff:0b.0 System peripheral: Intel Corporation Device 344d
ff:0b.1 System peripheral: Intel Corporation Device 344d
ff:0b.2 System peripheral: Intel Corporation Device 344d
ff:0b.3 System peripheral: Intel Corporation Device 344d
ff:0b.4 System peripheral: Intel Corporation Device 344d
ff:0b.5 System peripheral: Intel Corporation Device 344d
ff:0b.6 System peripheral: Intel Corporation Device 344d
ff:0b.7 System peripheral: Intel Corporation Device 344d
ff:0c.0 System peripheral: Intel Corporation Device 344d
ff:0c.1 System peripheral: Intel Corporation Device 344d
ff:0c.2 System peripheral: Intel Corporation Device 344d
ff:0c.3 System peripheral: Intel Corporation Device 344d
ff:0c.4 System peripheral: Intel Corporation Device 344d
ff:0c.5 System peripheral: Intel Corporation Device 344d
ff:0c.6 System peripheral: Intel Corporation Device 344d
ff:0c.7 System peripheral: Intel Corporation Device 344d
ff:0d.0 System peripheral: Intel Corporation Device 344d
ff:0d.1 System peripheral: Intel Corporation Device 344d
ff:0d.2 System peripheral: Intel Corporation Device 344d
ff:0d.3 System peripheral: Intel Corporation Device 344d
ff:0d.4 System peripheral: Intel Corporation Device 344d
ff:0d.5 System peripheral: Intel Corporation Device 344d
ff:0d.6 System peripheral: Intel Corporation Device 344d
ff:0d.7 System peripheral: Intel Corporation Device 344d
ff:0e.0 System peripheral: Intel Corporation Device 344d
ff:0e.1 System peripheral: Intel Corporation Device 344d
ff:0e.2 System peripheral: Intel Corporation Device 344d
ff:0e.3 System peripheral: Intel Corporation Device 344d
ff:0e.4 System peripheral: Intel Corporation Device 344d
ff:0e.5 System peripheral: Intel Corporation Device 344d
ff:0e.6 System peripheral: Intel Corporation Device 344d
ff:0e.7 System peripheral: Intel Corporation Device 344d
ff:1d.0 System peripheral: Intel Corporation Device 344f
ff:1d.1 System peripheral: Intel Corporation Device 3457
ff:1e.0 System peripheral: Intel Corporation Device 3458 (rev 06)
ff:1e.1 System peripheral: Intel Corporation Device 3459 (rev 06)
ff:1e.2 System peripheral: Intel Corporation Device 345a (rev 06)
ff:1e.3 System peripheral: Intel Corporation Device 345b (rev 06)
ff:1e.4 System peripheral: Intel Corporation Device 345c (rev 06)
ff:1e.5 System peripheral: Intel Corporation Device 345d (rev 06)
ff:1e.6 System peripheral: Intel Corporation Device 345e (rev 06)
ff:1e.7 System peripheral: Intel Corporation Device 345f (rev 06)
Artemy-Mellanox commented 1 year ago

@karanveersingh5623 setup has strange problems. could you please post the output of the following commands on the host, not docker.


uname -a
lsmod
modinfo ib_uverbs
modinfo ib_umad
modinfo mlx5_ib
ibstat
ibv_devinfo
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox

Please find the details from my compute node001

[root@node001 ~]# uname -a
Linux node001 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 14:48:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# lsmod
Module                  Size  Used by
mgc                   102400  1
lustre               1040384  7036
lmv                   204800  2 lustre
mdc                   278528  2 lustre
fid                    36864  1 mdc
lov                   344064  4693 mdc,lustre
fld                    45056  2 lov,lmv
osc                   454656  4692 mdc
ksocklnd              184320  1
ptlrpc               1425408  8 fld,osc,fid,mgc,lov,mdc,lmv,lustre
obdclass             3362816  4826 fld,osc,fid,ptlrpc,mgc,lov,mdc,lmv,lustre
lnet                  704512  7 osc,obdclass,ptlrpc,mgc,ksocklnd,lmv,lustre
libcfs                266240  12 fld,lnet,osc,fid,obdclass,ptlrpc,mgc,ksocklnd,lov,mdc,lmv,lustre
xt_conntrack           16384  1
ipt_MASQUERADE         16384  1
nf_conntrack_netlink    49152  0
nft_counter            16384  15
xt_addrtype            16384  2
nft_compat             20480  4
nft_chain_nat          16384  4
nf_nat                 45056  2 ipt_MASQUERADE,nft_chain_nat
nf_conntrack          172032  4 xt_conntrack,nf_nat,ipt_MASQUERADE,nf_conntrack_netlink
nf_defrag_ipv6         20480  1 nf_conntrack
nf_defrag_ipv4         16384  1 nf_conntrack
nf_tables             180224  43 nft_compat,nft_counter,nft_chain_nat
nfnetlink              16384  3 nft_compat,nf_conntrack_netlink,nf_tables
overlay               139264  0
dell_rbu               16384  0
nvidia_drm             69632  0
nvidia_modeset       1142784  1 nvidia_drm
nvidia_uvm           1298432  0
nvidia              40792064  163 nvidia_uvm,nvidia_modeset
intel_rapl_msr         16384  0
intel_rapl_common      24576  1 intel_rapl_msr
ipmi_ssif              36864  0
i10nm_edac             24576  0
nfit                   65536  1 i10nm_edac
libnvdimm             196608  1 nfit
x86_pkg_temp_thermal    16384  0
intel_powerclamp       16384  0
coretemp               16384  0
kvm_intel             339968  0
iTCO_wdt               16384  0
kvm                   905216  1 kvm_intel
irqbypass              16384  1 kvm
dell_smbios            24576  0
crc32_pclmul           16384  0
iTCO_vendor_support    16384  1 iTCO_wdt
dell_wmi_descriptor    16384  1 dell_smbios
wmi_bmof               16384  0
rapl                   20480  0
mgag200                36864  0
dcdbas                 16384  0
intel_cstate           20480  0
rpcrdma               282624  0
drm_kms_helper        266240  4 mgag200,nvidia_drm
intel_uncore          204800  0
pcspkr                 16384  0
syscopyarea            16384  1 drm_kms_helper
sysfillrect            16384  1 drm_kms_helper
joydev                 24576  0
sysimgblt              16384  1 drm_kms_helper
fb_sys_fops            16384  1 drm_kms_helper
isst_if_mbox_pci       16384  0
drm                   585728  5 drm_kms_helper,nvidia,mgag200,nvidia_drm
isst_if_mmio           16384  0
isst_if_common         16384  2 isst_if_mmio,isst_if_mbox_pci
mei_me                 45056  0
i2c_i801               28672  0
mei                   118784  1 mei_me
acpi_ipmi              16384  0
intel_pmt              16384  0
wmi                    32768  3 wmi_bmof,dell_smbios,dell_wmi_descriptor
ipmi_si                69632  1
acpi_power_meter       20480  0
binfmt_misc            20480  1
ipmi_devintf           20480  0
ipmi_msghandler       110592  4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif
lpfc                 1179648  0
nvmet_fc               40960  1 lpfc
nvmet                 110592  1 nvmet_fc
nvme_fc                53248  1 lpfc
nvme_fabrics           24576  1 nvme_fc
iavf                  151552  0
ixgbevf                77824  0
mlx4_en               135168  0
mlx4_core             364544  1 mlx4_en
qedr                  126976  0
qede                  184320  1 qedr
qed                   778240  2 qede,qedr
crc8                   16384  1 qed
hpilo                  20480  0
sr_mod                 28672  0
xts                    16384  0
dm_crypt               49152  0
bnxt_en               286720  0
mpt3sas               335872  0
raid_class             16384  1 mpt3sas
usb_storage            73728  0
squashfs               65536  0
loop                   40960  0
isofs                  49152  0
smartpqi               98304  0
dm_thin_pool           86016  0
dm_bio_prison          20480  1 dm_thin_pool
dm_persistent_data     94208  1 dm_thin_pool
dm_bufio               32768  1 dm_persistent_data
dm_mod                151552  3 dm_crypt,dm_thin_pool,dm_bufio
udf                   102400  0
crc_itu_t              16384  1 udf
cdrom                  65536  3 udf,isofs,sr_mod
scsi_transport_fc      81920  1 lpfc
vfat                   20480  1
fat                    81920  1 vfat
br_netfilter           24576  0
bridge                278528  1 br_netfilter
stp                    16384  1 bridge
llc                    16384  2 bridge,stp
xfs                  1556480  1
qla3xxx                49152  0
hpsa                  102400  0
e1000e                286720  0
ixgbe                 376832  0
igb                   253952  0
i2c_algo_bit           16384  2 igb,mgag200
dca                    16384  2 igb,ixgbe
megaraid_sas          176128  0
aacraid               139264  0
ata_piix               36864  0
sd_mod                 53248  0
mptspi                 28672  0
scsi_transport_spi     40960  1 mptspi
mptsas                 69632  0
mptscsih               45056  2 mptsas,mptspi
mptbase                98304  3 mptsas,mptspi,mptscsih
scsi_transport_sas     45056  4 mptsas,hpsa,smartpqi,mpt3sas
bnx2x                 876544  0
mdio                   16384  2 bnx2x,ixgbe
libcrc32c              16384  6 nf_conntrack,nf_nat,dm_persistent_data,bnx2x,nf_tables,xfs
bnx2                   94208  0
ext4                  761856  0
mbcache                16384  1 ext4
jbd2                  131072  1 ext4
e1000                 151552  0
nfsv4                 835584  0
dns_resolver           16384  1 nfsv4
nfsv3                  53248  1
nfs_acl                16384  1 nfsv3
nfs                   385024  4 nfsv4,nfsv3
lockd                 122880  2 nfsv3,nfs
grace                  16384  1 lockd
sunrpc                565248  23 lnet,rpcrdma,nfsv4,lockd,nfsv3,nfs_acl,nfs
fscache               385024  1 nfs
tun                    49152  0
irdma                 356352  0
ice                   765952  1 irdma
rdma_ucm               32768  0
ib_srpt                69632  0
ib_isert               57344  0
iscsi_target_mod      356352  1 ib_isert
target_core_mod       417792  3 iscsi_target_mod,ib_srpt,ib_isert
ib_iser                49152  0
libiscsi               61440  1 ib_iser
scsi_transport_iscsi   131072  2 ib_iser,libiscsi
ib_umad                28672  0
rdma_cm               114688  5 rpcrdma,ib_srpt,ib_iser,ib_isert,rdma_ucm
ib_ipoib              147456  0
iw_cm                  53248  1 rdma_cm
ib_cm                 114688  3 rdma_cm,ib_ipoib,ib_srpt
mlx5_ib               389120  0
ib_uverbs             163840  4 irdma,rdma_ucm,mlx5_ib,qedr
ib_core               393216  14 rdma_cm,ib_ipoib,rpcrdma,ib_srpt,iw_cm,ib_iser,ib_umad,ib_isert,irdma,rdma_ucm,ib_uverbs,mlx5_ib,qedr,ib_cm
sg                     40960  0
mlx5_core            1572864  1 mlx5_ib
crct10dif_pclmul       16384  1
crc32c_intel           24576  1
pci_hyperv_intf        16384  1 mlx5_core
i40e                  491520  1 irdma
ahci                   40960  0
psample                20480  1 mlx5_core
ghash_clmulni_intel    16384  0
nvme                   45056  3
libahci                40960  1 ahci
mlxfw                  28672  1 mlx5_core
tls                   102400  1 mlx5_core
libata                262144  3 ata_piix,libahci,ahci
nvme_core             114688  7 nvme,nvme_fc,nvme_fabrics
tg3                   188416  0
t10_pi                 16384  3 nvmet,sd_mod,nvme_core
fuse                  155648  1
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# modinfo ib_uverbs
filename:       /lib/modules/4.18.0-372.9.1.el8.x86_64/kernel/drivers/infiniband/core/ib_uverbs.ko.xz
alias:          rdma-client-uverbs
license:        Dual BSD/GPL
description:    InfiniBand userspace verbs access
author:         Roland Dreier
rhelversion:    8.6
srcversion:     F485E52CF6F50429494777A
depends:        ib_core
intree:         Y
name:           ib_uverbs
vermagic:       4.18.0-372.9.1.el8.x86_64 SMP mod_unload modversions
sig_id:         PKCS#7
signer:         Rocky kernel signing key
sig_key:        24:62:83:5E:57:6D:46:8C:7B:45:DD:87:7C:69:5A:C6:BC:46:85:94
sig_hashalgo:   sha256
signature:      3B:5D:F8:D7:4E:50:C2:51:0E:AB:BD:C8:26:B9:7E:DB:F8:41:15:F3:
                83:06:82:74:BE:CC:D7:55:CC:C9:52:93:67:F8:6E:7D:44:09:FC:45:
                4F:8E:30:49:42:A1:6B:6D:B8:8C:D5:D9:B0:E8:2B:9B:B8:F2:AB:BA:
                61:72:A9:56:1C:B5:2C:CB:86:31:64:7E:3D:4F:ED:78:49:CA:5D:FD:
                5F:AB:0C:E2:5B:45:A0:40:7A:E8:5B:7C:6A:EE:F3:18:CC:E5:38:58:
                94:C7:90:B1:66:64:63:25:57:0C:85:B8:F6:FD:60:B0:70:90:67:3A:
                9F:8F:62:7F:A6:A8:E1:50:57:4A:5C:43:E8:9C:6C:B1:91:46:4F:64:
                61:91:BE:C9:DD:48:07:62:70:A1:90:81:00:DD:50:11:CC:D8:F5:F4:
                B5:86:79:82:FD:78:49:65:77:05:85:4F:A5:F3:F5:D6:54:E8:CD:A7:
                DA:F5:6E:0F:32:F1:B3:BE:09:52:1B:33:18:BC:0A:56:1D:73:10:66:
                E7:6A:6F:A7:A6:08:28:64:D4:3E:EB:66:64:C0:C1:3D:E7:16:1A:38:
                A3:D5:3B:4E:0F:05:83:A1:1E:95:44:20:D9:19:C7:5D:9C:CA:E8:3E:
                F5:C9:6F:5E:88:6B:50:0B:8B:B0:EF:6C:E2:5F:61:39:91:32:E0:C6:
                67:92:C9:9F:8F:5E:2D:E9:9C:D7:07:7B:7E:AF:AC:3F:FB:72:B3:2D:
                37:93:FC:24:0C:55:4F:28:53:D5:5D:66:AB:2F:E1:CF:A7:EE:6C:C3:
                71:4A:9D:1B:85:4F:62:DA:12:FF:D1:87:F5:4C:48:2B:F1:5D:9F:24:
                50:A4:BA:1D:6B:99:77:61:9B:65:39:9F:56:51:5F:65:C4:4F:3E:5D:
                A0:91:93:E0:5E:7C:73:95:D8:C8:B1:E2:D9:BE:F5:0E:D5:82:64:D8:
                01:C6:49:0D:1F:C0:CD:DC:5B:99:43:86:95:05:B6:8A:30:44:57:4E:
                C9:75:FC:09
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# modinfo ib_umad
filename:       /lib/modules/4.18.0-372.9.1.el8.x86_64/kernel/drivers/infiniband/core/ib_umad.ko.xz
alias:          rdma-client-issm
alias:          rdma-client-umad
license:        Dual BSD/GPL
description:    InfiniBand userspace MAD packet access
author:         Roland Dreier
rhelversion:    8.6
srcversion:     EEA36F7782E21E939DF90E0
depends:        ib_core
intree:         Y
name:           ib_umad
vermagic:       4.18.0-372.9.1.el8.x86_64 SMP mod_unload modversions
sig_id:         PKCS#7
signer:         Rocky kernel signing key
sig_key:        24:62:83:5E:57:6D:46:8C:7B:45:DD:87:7C:69:5A:C6:BC:46:85:94
sig_hashalgo:   sha256
signature:      1D:FC:C2:92:9D:C7:32:66:5A:09:CD:64:64:96:A5:12:4A:4B:84:F6:
                4C:0E:12:B0:61:F4:55:49:D3:05:79:02:90:F3:AF:40:0D:4A:96:62:
                30:7B:D5:42:C9:9F:6C:CD:9C:EF:D9:D5:B9:B4:FC:73:C3:3E:25:9C:
                07:0E:C8:90:CA:72:08:A7:67:93:1F:EB:ED:89:B9:AA:16:17:91:CE:
                1E:18:D6:80:C1:CA:03:8F:04:C8:03:AC:49:B0:D6:4E:EA:F4:2D:6E:
                9E:9D:83:F2:33:EF:6B:AF:D3:EA:6E:8B:47:9C:5A:29:11:B9:3F:CF:
                16:88:55:6F:38:0E:95:01:38:75:EE:81:15:2E:8F:F5:A1:F2:1D:33:
                04:49:0A:E9:DE:3C:D5:27:17:AE:12:96:0A:DE:9E:DB:CD:3B:0D:E6:
                22:9F:26:CB:44:C2:56:9D:06:27:9E:F4:A5:AC:D9:8D:A8:B4:3B:94:
                23:74:02:F2:55:75:B8:65:AD:8A:F7:B7:8B:9C:BD:7E:B0:D6:CF:C9:
                33:08:F2:5A:91:DC:36:57:72:21:1D:E0:E1:EF:F5:C4:4B:FA:C3:4C:
                95:D3:8C:8D:50:3F:CC:B8:0F:0A:84:7E:F3:C2:8E:9D:EE:F9:D8:B5:
                19:9B:65:42:D0:37:77:10:3B:CF:D6:92:85:BD:D0:55:A1:2C:6D:2F:
                FD:8E:17:87:C4:4B:E2:D7:12:9C:73:B0:A1:63:9B:FE:2E:2D:FC:94:
                E8:E2:0C:CA:F2:1D:EC:27:E1:D5:9B:00:F1:08:53:8B:A3:92:F1:10:
                30:D2:91:F6:5F:F0:B6:C2:2A:82:86:D9:ED:20:BB:9B:BF:EF:4C:4A:
                A2:9B:DB:CF:E9:64:5D:7D:E8:0D:A6:22:25:B3:1A:F8:F5:63:E0:D4:
                7B:96:9E:AF:24:38:54:56:35:53:C3:AC:49:C0:CD:D5:33:8A:56:7D:
                D7:C0:46:6F:9A:97:A3:F2:7E:14:3C:9A:6A:6D:36:EF:D5:F2:4A:10:
                E2:02:52:AC
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# modinfo mlx5_ib
filename:       /lib/modules/4.18.0-372.9.1.el8.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz
license:        Dual BSD/GPL
description:    Mellanox 5th generation network adapters (ConnectX series) IB driver
author:         Eli Cohen <eli@mellanox.com>
rhelversion:    8.6
srcversion:     D733C181AA9D6B40A8CBDD4
alias:          auxiliary:mlx5_core.rdma
alias:          auxiliary:mlx5_core.multiport
alias:          auxiliary:mlx5_core.rdma-rep
depends:        mlx5_core,ib_core,ib_uverbs
intree:         Y
name:           mlx5_ib
vermagic:       4.18.0-372.9.1.el8.x86_64 SMP mod_unload modversions
sig_id:         PKCS#7
signer:         Rocky kernel signing key
sig_key:        24:62:83:5E:57:6D:46:8C:7B:45:DD:87:7C:69:5A:C6:BC:46:85:94
sig_hashalgo:   sha256
signature:      4A:A4:31:5C:5E:15:11:F8:29:44:2D:BA:41:1B:1E:5E:0D:B2:E4:2A:
                72:9C:7C:F5:A2:5E:09:41:85:CF:4E:91:6D:1D:21:7F:3B:1D:B6:F7:
                B0:F4:F3:CA:9D:51:9C:60:96:47:11:F3:DB:52:0E:C0:AF:21:40:5F:
                3D:C9:48:29:2B:3A:FE:84:A6:92:4B:52:57:AA:A0:4C:D7:FE:29:D1:
                74:6B:F8:67:0F:6F:52:3C:DD:0F:69:7B:D0:F5:13:14:22:F8:23:F2:
                A1:78:CE:A3:4F:88:FC:8D:D6:A4:0D:A8:6B:82:13:AC:E7:3E:E3:B6:
                A2:4E:B7:64:97:CA:03:32:AB:FF:0E:4D:08:2B:4C:F1:88:93:6F:97:
                D2:D5:74:79:77:77:E1:15:71:06:AC:7C:AB:97:23:04:16:E4:59:A5:
                14:01:2D:CF:F1:EF:3D:29:9B:9A:FB:43:01:BE:9F:34:89:2E:92:30:
                87:6C:0F:04:9E:88:A2:EC:D1:E5:76:9A:A0:12:62:B3:86:30:CF:0A:
                99:57:6C:98:29:F0:43:47:87:47:F3:0F:E7:F5:15:A0:D0:3D:98:83:
                36:71:32:D0:BF:60:A4:B0:3D:1A:24:AF:9C:CC:12:67:10:6A:47:62:
                08:A8:A5:72:1F:AF:46:D9:56:F0:D0:2C:D6:C5:C5:D7:CF:44:54:F2:
                A5:49:F4:E8:76:B4:F1:82:0B:C8:7C:99:38:4C:86:DB:60:F6:7E:0B:
                D8:4D:19:B8:D1:BE:20:F2:22:5F:B8:DE:7E:FE:18:D9:A0:35:E3:B6:
                18:33:7E:C6:DC:C8:3A:5E:16:7F:14:61:FA:65:FE:E3:51:98:01:B1:
                99:49:81:69:A0:4B:32:64:9F:6B:F8:5F:4A:8A:50:E9:15:D7:A6:FB:
                D3:06:3D:EE:94:69:2A:9A:D4:9A:61:67:F9:8D:42:2D:44:A1:EE:B8:
                D6:1B:75:FE:EF:85:35:7B:00:A9:F9:04:81:68:99:6A:FD:71:51:27:
                06:ED:8A:2B
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# ibstat
CA 'mlx5_0'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.1014
        Hardware version: 0
        Node GUID: 0x043f720300dc0684
        System image GUID: 0x043f720300dc0684
        Port 1:
                State: Active
                Physical state: LinkUp
                Rate: 100
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x063f72fffedc0684
                Link layer: Ethernet
CA 'mlx5_1'
        CA type: MT4119
        Number of ports: 1
        Firmware version: 16.31.1014
        Hardware version: 0
        Node GUID: 0x043f720300dc0685
        System image GUID: 0x043f720300dc0684
        Port 1:
                State: Down
                Physical state: Disabled
                Rate: 40
                Base lid: 0
                LMC: 0
                SM lid: 0
                Capability mask: 0x00010000
                Port GUID: 0x063f72fffedc0685
                Link layer: Ethernet
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# ibv_devinfo
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.31.1014
        node_guid:                      043f:7203:00dc:0684
        sys_image_guid:                 043f:7203:00dc:0684
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000012
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         16.31.1014
        node_guid:                      043f:7203:00dc:0685
        sys_image_guid:                 043f:7203:00dc:0684
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000012
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
Artemy-Mellanox commented 1 year ago

@karanveersingh5623 could you please create a file called 50-mellanox.env with the following string inside MELLANOX_VISIBLE_DEVICES=all put it in /etc/enroot/environ.d/ directory on node002 and every compute node you are using and rerun

srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibv_devinfo
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox

[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibv_devinfo

pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.31.1014
        node_guid:                      043f:7203:00dc:0684
        sys_image_guid:                 043f:7203:00dc:0684
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000012
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

hca_id: mlx5_1
        transport:                      InfiniBand (0)
        fw_ver:                         16.31.1014
        node_guid:                      043f:7203:00dc:0685
        sys_image_guid:                 043f:7203:00dc:0684
        vendor_id:                      0x02c9
        vendor_part_id:                 4119
        hw_ver:                         0x0
        board_id:                       MT_0000000012
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_DOWN (1)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet
Artemy-Mellanox commented 1 year ago

probably that was the missing part could you please rerun the original run_and_time.sh scenario

karanveersingh5623 commented 1 year ago

@Artemy-Mellanox Thanks for it , it worked :) But its taking a hell lot of time , after 2.5 hrs , its still at epoch 1 . Total epochs are 5 . When I run on single node with multi-GPUs , the same task finishes in 15~20 min .

[root@bright88 mxnet]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               149      defq     bash     root  R    2:38:18      2 node[001-002]
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh

pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
STARTING TIMING RUN AT 2022-11-22 02:31:51 PM
running benchmark
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=0-31,64-95 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
STARTING TIMING RUN AT 2022-11-22 02:31:57 PM
running benchmark
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=22
+ exec numactl --physcpubind=0-21,44-65 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
[14:31:59] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
Namespace(apply_log_transform=True, base_lr=0.001, config_file=None, cuda_profiler_range='', dali_num_threads=64, dali_use_mmap=False, data_layout='NDHWC', data_root_dir=PosixPath('/data'), data_shard_multiplier=1, dropout=0.5, grad_prediv_factor=1.0, initial_lr=0.001, instances=1, load_checkpoint='', log_prefix='run__{}_.log', lr_scheduler_decays=[0.25, 0.125], lr_scheduler_epochs=[32, 64], momentum=0.9, num_epochs=5, preshuffle=True, prestage=False, profile=False, save_checkpoint='/results/checkpoint.data', seed=0, shard_type='local', shuffle=True, spatial_span=1, static_loss_scale=16384, target_mae=0.124, training_batch_size=16, training_samples=-1, use_amp=False, use_fp16=False, validation_batch_size=16, validation_samples=-1, warmup_epochs=1, weight_decay=0.0)
:::MLLOG {"namespace": "", "time_ms": 1669095119147, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": null, "metadata": {"file": "train.py", "lineno": 134}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "train.py", "lineno": 135}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "cosmoflow", "metadata": {"file": "train.py", "lineno": 137}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "NVIDIA", "metadata": {"file": "train.py", "lineno": 139}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "train.py", "lineno": 140}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "train.py", "lineno": 141}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "2xNVIDIA DGX A100", "metadata": {"file": "train.py", "lineno": 142}}
:::MLLOG {"namespace": "", "time_ms": 1669095119177, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 2, "metadata": {"file": "train.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1669095119177, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 1, "metadata": {"file": "train.py", "lineno": 146}}
[14:31:59] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
[14:32:00] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[14:32:00] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for dgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 64 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for wgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 64 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for wgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 32 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for dgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 64 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for wgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 64 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for wgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 32 num_group: 1 workspace: 1024
node001:3316259:3316286 [0] NCCL INFO Bootstrap : Using eth3:192.168.61.89<0>
node001:3316259:3316286 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
node001:3316259:3316286 [0] NCCL INFO P2P plugin IBext
node001:3316259:3316286 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth3:192.168.61.89<0>
node001:3316259:3316286 [0] NCCL INFO Using network IBext
NCCL version 2.11.4+cuda11.4
node002:2000130:2000157 [0] NCCL INFO Bootstrap : Using eth4:192.168.61.90<0>
node002:2000130:2000157 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
node002:2000130:2000157 [0] NCCL INFO P2P plugin IBext
node002:2000130:2000157 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth4:192.168.61.90<0>
node002:2000130:2000157 [0] NCCL INFO Using network IBext
node001:3316259:3316286 [0] NCCL INFO Channel 00/02 :    0   1
node001:3316259:3316286 [0] NCCL INFO Channel 01/02 :    0   1
node001:3316259:3316286 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
node001:3316259:3316286 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,00000000,55555555
node002:2000130:2000157 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
node002:2000130:2000157 [0] NCCL INFO Setting affinity for GPU 0 to 02,aaaaa000,002aaaaa
node001:3316259:3316286 [0] NCCL INFO Channel 00 : 1[af000] -> 0[17000] [receive] via NET/IBext/0
node001:3316259:3316286 [0] NCCL INFO Channel 01 : 1[af000] -> 0[17000] [receive] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Channel 00 : 0[17000] -> 1[af000] [receive] via NET/IBext/0
node001:3316259:3316286 [0] NCCL INFO Channel 00 : 0[17000] -> 1[af000] [send] via NET/IBext/0
node001:3316259:3316286 [0] NCCL INFO Channel 01 : 0[17000] -> 1[af000] [send] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Channel 01 : 0[17000] -> 1[af000] [receive] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Channel 00 : 1[af000] -> 0[17000] [send] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Channel 01 : 1[af000] -> 0[17000] [send] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Connected all rings
node002:2000130:2000157 [0] NCCL INFO Connected all trees
node002:2000130:2000157 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
node002:2000130:2000157 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
node002:2000130:2000157 [0] NCCL INFO comm 0x15540845f760 rank 1 nranks 2 cudaDev 0 busId af000 - Init COMPLETE
node001:3316259:3316286 [0] NCCL INFO Connected all rings
node001:3316259:3316286 [0] NCCL INFO Connected all trees
node001:3316259:3316286 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
node001:3316259:3316286 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
node001:3316259:3316286 [0] NCCL INFO comm 0x155408460740 rank 0 nranks 2 cudaDev 0 busId 17000 - Init COMPLETE
node001:3316259:3316286 [0] NCCL INFO Launch mode Parallel

:::MLLOG {"namespace": "", "time_ms": 1669097805467, "event_type": "POINT_IN_TIME", "key": "opt_weight_decay", "value": 0.0, "metadata": {"file": "train.py", "lineno": 165}}
:::MLLOG {"namespace": "", "time_ms": 1669097805467, "event_type": "POINT_IN_TIME", "key": "dropout", "value": 0.5, "metadata": {"file": "train.py", "lineno": 167}}
:::MLLOG {"namespace": "", "time_ms": 1669097805555, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 32, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 352}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 32768, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 354}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 16384, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 355}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.001, "metadata": {"file": "train.py", "lineno": 92}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_epochs", "value": 1, "metadata": {"file": "train.py", "lineno": 94}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_factor", "value": 1, "metadata": {"file": "train.py", "lineno": 96}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_decay_boundary_epochs", "value": [32, 64], "metadata": {"file": "train.py", "lineno": 98}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_decay_factor", "value": [0.25, 0.125], "metadata": {"file": "train.py", "lineno": 100}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_name", "value": "sgd", "metadata": {"file": "train.py", "lineno": 184}}
:::MLLOG {"namespace": "", "time_ms": 1669097805566, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/workspace/cosmoflow/utils.py", "lineno": 144}}
:::MLLOG {"namespace": "", "time_ms": 1669097805566, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "train.py", "lineno": 206}}
:::MLLOG {"namespace": "", "time_ms": 1669097805566, "event_type": "INTERVAL_START", "key": "staging_start", "value": null, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 359}}
:::MLLOG {"namespace": "", "time_ms": 1669097807786, "event_type": "INTERVAL_END", "key": "staging_stop", "value": null, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 362, "staging_duration": 2.219392776489258}}
:::MLLOG {"namespace": "", "time_ms": 1669097807786, "event_type": "INTERVAL_START", "key": "epoch_start", "value": null, "metadata": {"file": "train.py", "lineno": 215, "epoch_num": 1}}
Artemy-Mellanox commented 1 year ago

@karanveersingh5623 can we narrow down this issue by running some basic rdma performance tests on this setup?

karanveersingh5623 commented 1 year ago

@Artemy-Mellanox , lets do it :)

after 19 Hrs , still at epoch 3 .

[root@bright88 mxnet]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               149      defq     bash     root  R   19:51:40      2 node[001-002]
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox How can I verify RDMA ? I guess I havent installed any MLNX_OFED driver on hosts . I am using ConnectX-5 with TCP I have not installed the below packages . Hosts are on Rocky Linux 8.6

# yum -y groupinstall "InfiniBand Support"

# yum -y install perftest infiniband-diags

It is recommended to install the latest MLNX_OFED, however, it is possible to use the RDMA inbox drivers.

RDMA / RoCE with Connect X-5 Gbe card is possible but I guess me just using TCP packets for communications . Please correct me and let me know the next steps

Artemy-Mellanox commented 1 year ago

@karanveersingh5623 could you please install perftest package to test rdma between nodes Try to add to your script

sleep $SLURM_NODEID && ib_send_bw $([ $SLURM_NODEID == 0 ] || echo node001)
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox

Server

[root@node001 perftest]# ./ib_send_bw

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    Send BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 RX depth        : 512
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x4f24 PSN 0x3c89e0
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:89
 remote address: LID 0000 QPN 0x1de0 PSN 0x67f45c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:90
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             0.00               6284.84            0.100557
---------------------------------------------------------------------------------------
ethernet_read_keys: Couldn't read remote address
 Unable to read to socket/rdma_cm
 Failed to close connection between server and client
 Trying to close this side resources

Client

[root@node002 perftest]# ./ib_send_bw -b node001
---------------------------------------------------------------------------------------
                    Send Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 RX depth        : 512
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x1de0 PSN 0x67f45c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:90
 remote address: LID 0000 QPN 0x4f24 PSN 0x3c89e0
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:89
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]

 Did not get Message for 120 Seconds, exiting..
 Total Received=0, Total Iters Required=1000
Artemy-Mellanox commented 1 year ago

ib_send_bw shows good performance (-b option need on the server side too so there'd be no error) What is the expected time to finish the cosmoflow benchmark with this dataset?

karanveersingh5623 commented 1 year ago

@Artemy-Mellanox ...I ran with -b option on server side , I am getting 13GB/s...its bidirectional , so max throughput will be around 5~6 GB/s or 50Gbps.....thats right as my Mellanox card sits on X8 PCIe slot , not X16 otherwise I would have got 100Gbps

[root@node004 perftest]# ./ib_send_bw -b node003
---------------------------------------------------------------------------------------
                    Send Bidirectional BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 RX depth        : 512
 CQ Moderation   : 1
 Mtu             : 1024[B]
 Link type       : Ethernet
 GID index       : 3
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x463b PSN 0x645ee4
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:92
 remote address: LID 0000 QPN 0x4d63 PSN 0x5a236c
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:91
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[MB/sec]    BW average[MB/sec]   MsgRate[Mpps]
 65536      1000             13188.14            13187.32                  0.210997
---------------------------------------------------------------------------------------
karanveersingh5623 commented 1 year ago

ib_send_bw shows good performance (-b option need on the server side too so there'd be no error) What is the expected time to finish the cosmoflow benchmark with this dataset?

As I am just running 5 epochs , max time to finish when using single node(multi-GPUs) is 20~25 min .

Artemy-Mellanox commented 1 year ago

@karanveersingh5623 could you please download osu benchmark, build it with CUDA

./configure CC=$OMPI_HOME/bin/mpicc CXX=$OMPI_HOME/bin/mpic++ --enable-cuda --with-cuda=$CUDA_HOME

and run mpi/pt2pt/osu_bw D D test between two nodes using mpirun

karanveersingh5623 commented 1 year ago

@Artemy-Mellanox , configure is failing for OSU benchmark , please refer trace below and let me know any pointers

[root@node001 OSU_Microbenchmarks]# nvidia-smi
Tue Dec 13 16:55:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:17:00.0 Off |                    0 |
| N/A   33C    P0    41W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:65:00.0 Off |                    0 |
| N/A   34C    P0    42W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                    0 |
| N/A   34C    P0    45W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100 80G...  On   | 00000000:E3:00.0 Off |                    0 |
| N/A   35C    P0    44W / 300W |      0MiB / 81920MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# ls /cm/shared/apps/cuda11.7/toolkit/11.7.1/
bin  C  compat  compute-sanitizer  CUDA_Toolkit_Release_Notes.txt  DOCS  etc  EULA.txt  extras  gds  include  lib64  LICENSE  man  nvml  nvvm  README  share  src  targets  tools  version.json
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# ls /usr/lib64/openmpi/bin/
aggregate_profile.pl  mpicc  mpicxx   mpif77  mpifort  ompi-clean  ompi-server   ortecc      orted      orterun      oshc++  oshCC   oshfort      oshrun          shmemc++  shmemCC   shmemfort
mpic++                mpiCC  mpiexec  mpif90  mpirun   ompi_info   opal_wrapper  orte-clean  orte-info  orte-server  oshcc   oshcxx  oshmem_info  profile2mat.pl  shmemcc   shmemcxx  shmemrun
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# ./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpic++ --enable-cuda --with-cuda=/cm/shared/apps/cuda11.7/toolkit/11.7.1
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for style of include used by make... GNU
checking for gcc... /usr/lib64/openmpi/bin/mpicc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by /usr/lib64/openmpi/bin/mpicc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 3458764513820540925
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for ar... ar
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from /usr/lib64/openmpi/bin/mpicc object... ok
checking how to run the C preprocessor... /usr/lib64/openmpi/bin/mpicc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if /usr/lib64/openmpi/bin/mpicc supports -fno-rtti -fno-exceptions... no
checking for /usr/lib64/openmpi/bin/mpicc option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpicc PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpicc static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpicc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether we are using the GNU C compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... (cached) yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... (cached) none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... (cached) gcc3
checking whether we are using the GNU C++ compiler... yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... gcc3
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... (cached) yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... (cached) gcc3
checking how to run the C++ preprocessor... /usr/lib64/openmpi/bin/mpic++ -E
checking for ld used by /usr/lib64/openmpi/bin/mpic++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for /usr/lib64/openmpi/bin/mpic++ option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpic++ PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpic++ static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for library containing sqrt... -lm
checking for library containing pthread_join... none required
checking for library containing clock_gettime... none required
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for inline... inline
checking for getpagesize... yes
checking for gettimeofday... yes
checking for memset... yes
checking for sqrt... yes
checking for MPI_Init... yes
checking for MPI_Accumulate... yes
checking for MPI_Get_accumulate... yes
checking for shmem_barrier_all... no
checking for upc_memput... no
checking whether upcxx_alltoall is declared... no
checking for library containing cuPointerGetAttribute... no
configure: error: cannot link with -lcuda
[root@node001 OSU_Microbenchmarks]#
Artemy-Mellanox commented 1 year ago

configure failed to link with CUDA. could you please attach config.log to identify the reason

karanveersingh5623 commented 1 year ago

@Artemy-Mellanox ...pfa config.log

Artemy-Mellanox commented 1 year ago

You need either install the nvidia-driver-latest-cuda-libs packages or add /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs to LD_LIBRARY_PATH

karanveersingh5623 commented 1 year ago

@Artemy-Mellanox .......same !!

[root@node001 OSU_Microbenchmarks]# ll
total 1448
-rw-r--r-- 1 root root 316266 Dec 13 16:28 aclocal.m4
-rw-r--r-- 1 root root   9579 Dec 13 16:28 CHANGES
-rwxr-xr-x 1 root root  44941 Dec 13 16:28 config.guess
-rw-r--r-- 1 root root  51810 Dec 26 10:30 config.log
-rwxr-xr-x 1 root root  34423 Dec 13 16:28 config.sub
-rwxr-xr-x 1 root root 607857 Dec 13 16:28 configure
-rw-r--r-- 1 root root   6275 Dec 13 16:28 configure.ac
-rw-r--r-- 1 root root   2024 Dec 13 16:28 COPYRIGHT
-rwxr-xr-x 1 root root  18615 Dec 13 16:28 depcomp
-rwxr-xr-x 1 root root     66 Dec 13 16:28 get_local_rank
-rwxr-xr-x 1 root root  13663 Dec 13 16:28 install-sh
-rwxr-xr-x 1 root root 243248 Dec 13 16:28 ltmain.sh
-rw-r--r-- 1 root root    252 Dec 13 16:28 Makefile.am
-rw-r--r-- 1 root root  24933 Dec 13 16:28 Makefile.in
-rwxr-xr-x 1 root root  11419 Dec 13 16:28 missing
drwxr-xr-x 6 root root    135 Dec 13 16:28 mpi
drwxr-xr-x 2 root root   4096 Dec 13 16:28 openshmem
-rw-r--r-- 1 root root  46257 Dec 13 16:28 README
drwxr-xr-x 2 root root   4096 Dec 13 16:28 upc
drwxr-xr-x 2 root root   4096 Dec 13 16:28 upcxx
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# echo $LD_LIBRARY_PATH

[root@node001 OSU_Microbenchmarks]# ls /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs
libcublasLt.so  libcuda.so   libcufftw.so  libcusolverMg.so  libcusparse.so  libnppial.so  libnppidei.so  libnppig.so  libnppist.so  libnppitc.so  libnvidia-ml.so  libnvrtc.so
libcublas.so    libcufft.so  libcurand.so  libcusolver.so    libnppc.so      libnppicc.so  libnppif.so    libnppim.so  libnppisu.so  libnpps.so    libnvjpeg.so
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# export LD_LIBRARY_PATH=/cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# echo $LD_LIBRARY_PATH
/cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# ./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpic++ --enable-cuda --with-cuda=/cm/shared/apps/cuda11.7/toolkit/11.7.1
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for style of include used by make... GNU
checking for gcc... /usr/lib64/openmpi/bin/mpicc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by /usr/lib64/openmpi/bin/mpicc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 3458764513820540925
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for ar... ar
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from /usr/lib64/openmpi/bin/mpicc object... ok
checking how to run the C preprocessor... /usr/lib64/openmpi/bin/mpicc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if /usr/lib64/openmpi/bin/mpicc supports -fno-rtti -fno-exceptions... no
checking for /usr/lib64/openmpi/bin/mpicc option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpicc PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpicc static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpicc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether we are using the GNU C compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... (cached) yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... (cached) none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... (cached) gcc3
checking whether we are using the GNU C++ compiler... yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... gcc3
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... (cached) yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... (cached) gcc3
checking how to run the C++ preprocessor... /usr/lib64/openmpi/bin/mpic++ -E
checking for ld used by /usr/lib64/openmpi/bin/mpic++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for /usr/lib64/openmpi/bin/mpic++ option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpic++ PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpic++ static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for library containing sqrt... -lm
checking for library containing pthread_join... none required
checking for library containing clock_gettime... none required
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for inline... inline
checking for getpagesize... yes
checking for gettimeofday... yes
checking for memset... yes
checking for sqrt... yes
checking for MPI_Init... yes
checking for MPI_Accumulate... yes
checking for MPI_Get_accumulate... yes
checking for shmem_barrier_all... no
checking for upc_memput... no
checking whether upcxx_alltoall is declared... no
checking for library containing cuPointerGetAttribute... no
configure: error: cannot link with -lcuda
karanveersingh5623 commented 1 year ago

Attaching config.log config.log

Artemy-Mellanox commented 1 year ago

Could you please add LDFLAGS=-Wl,--verbose option to ./configure and then attach config.log, like

./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpic++ LDFLAGS=-Wl,--verbose --enable-cuda --with-cuda=/cm/shared/apps/cuda11.7/toolkit/11.7.1
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox ....PFA config.log

Artemy-Mellanox commented 1 year ago

Could you please add /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs to LIBRARY_PATH as well

karanveersingh5623 commented 1 year ago

@Artemy-Mellanox , looks like configure is through..... but make is failing , I set the path variable because NVCC was not found

[root@node001 OSU_Microbenchmarks]# ./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpic++ LDFLAGS=-Wl,--verbose --enable-cuda --with-cuda=/cm/shared/apps/cuda11.7/toolkit/11.7.1
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for style of include used by make... GNU
checking for gcc... /usr/lib64/openmpi/bin/mpicc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by /usr/lib64/openmpi/bin/mpicc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 3458764513820540925
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for ar... ar
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from /usr/lib64/openmpi/bin/mpicc object... ok
checking how to run the C preprocessor... /usr/lib64/openmpi/bin/mpicc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if /usr/lib64/openmpi/bin/mpicc supports -fno-rtti -fno-exceptions... no
checking for /usr/lib64/openmpi/bin/mpicc option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpicc PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpicc static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpicc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether we are using the GNU C compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... (cached) yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... (cached) none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... (cached) gcc3
checking whether we are using the GNU C++ compiler... yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... gcc3
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... (cached) yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... (cached) gcc3
checking how to run the C++ preprocessor... /usr/lib64/openmpi/bin/mpic++ -E
checking for ld used by /usr/lib64/openmpi/bin/mpic++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for /usr/lib64/openmpi/bin/mpic++ option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpic++ PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpic++ static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for library containing sqrt... -lm
checking for library containing pthread_join... none required
checking for library containing clock_gettime... none required
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for inline... inline
checking for getpagesize... yes
checking for gettimeofday... yes
checking for memset... yes
checking for sqrt... yes
checking for MPI_Init... yes
checking for MPI_Accumulate... yes
checking for MPI_Get_accumulate... yes
checking for shmem_barrier_all... no
checking for upc_memput... no
checking whether upcxx_alltoall is declared... no
checking for library containing cuPointerGetAttribute... -lcuda
checking for library containing cudaFree... -lcudart
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
configure: creating ./config.status
config.status: creating Makefile
config.status: creating mpi/Makefile
config.status: creating mpi/pt2pt/Makefile
config.status: creating mpi/startup/Makefile
config.status: creating mpi/one-sided/Makefile
config.status: creating mpi/collective/Makefile
config.status: creating openshmem/Makefile
config.status: creating upc/Makefile
config.status: creating upcxx/Makefile
config.status: executing depfiles commands
config.status: executing libtool commands

make clean & make

[root@node001 OSU_Microbenchmarks]# echo $PATH
/cm/shared/apps/cuda11.7/toolkit/11.7.1/bin:/usr/lib64/openmpi/bin:/cm/local/apps/environment-modules/4.5.3//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/cm/local/apps/environment-modules/4.5.3/bin:/opt/dell/srvadmin/bin:/opt/dell/srvadmin/sbin:/root/bin

attempt to open //usr/x86_64-redhat-linux/lib64/libevent_core-2.1.so.6 failed
found libevent_core-2.1.so.6 at //usr/lib64/libevent_core-2.1.so.6
libevent_pthreads-2.1.so.6 needed by /usr/lib64/openmpi/lib/libmpi.so
attempt to open //cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //cm/shared/apps/cuda11.7/toolkit/11.7.1/lib/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64/openmpi/lib/libevent_pthreads-2.1.so.6 failed
attempt to open /usr/lib64/openmpi/lib/libevent_pthreads-2.1.so.6 failed
attempt to open /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64/openmpi/lib/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64/atlas/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64//bind9-export/libevent_pthreads-2.1.so.6 failed
attempt to open //cm/local/apps/cuda/libs/current/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64/dyninst/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/srvadmin-idrac7/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/srvadmin-isvc/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/smpop/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/x86_64-redhat-linux/lib64/libevent_pthreads-2.1.so.6 failed
found libevent_pthreads-2.1.so.6 at //usr/lib64/libevent_pthreads-2.1.so.6
libcrypto.so.1.1 needed by //usr/lib64/libevent_core-2.1.so.6
attempt to open //cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/libcrypto.so.1.1 failed
attempt to open //cm/shared/apps/cuda11.7/toolkit/11.7.1/lib/libcrypto.so.1.1 failed
attempt to open //usr/lib64/openmpi/lib/libcrypto.so.1.1 failed
attempt to open /usr/lib64/openmpi/lib/libcrypto.so.1.1 failed
attempt to open /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs/libcrypto.so.1.1 failed
attempt to open //usr/lib64/atlas/libcrypto.so.1.1 failed
attempt to open //usr/lib64//bind9-export/libcrypto.so.1.1 failed
attempt to open //cm/local/apps/cuda/libs/current/lib64/libcrypto.so.1.1 failed
attempt to open //usr/lib64/dyninst/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/srvadmin-idrac7/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/srvadmin-isvc/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/smpop/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //usr/x86_64-redhat-linux/lib64/libcrypto.so.1.1 failed
found libcrypto.so.1.1 at //usr/lib64/libcrypto.so.1.1
make[2]: Leaving directory '/cm/shared/OSU_Microbenchmarks/mpi/one-sided'
make[2]: Entering directory '/cm/shared/OSU_Microbenchmarks/mpi'
make[2]: Nothing to be done for 'all-am'.
make[2]: Leaving directory '/cm/shared/OSU_Microbenchmarks/mpi'
make[1]: Leaving directory '/cm/shared/OSU_Microbenchmarks/mpi'
make[1]: Entering directory '/cm/shared/OSU_Microbenchmarks'
make[1]: Nothing to be done for 'all-am'.
make[1]: Leaving directory '/cm/shared/OSU_Microbenchmarks'
[root@node001 OSU_Microbenchmarks]# ll
total 2628
-rw-r--r-- 1 root root 316266 Dec 30 16:11 aclocal.m4
drwxr-xr-x 2 root root     70 Dec 30 16:11 autom4te.cache
-rw-r--r-- 1 root root   9579 Dec 13 16:28 CHANGES
-rwxr-xr-x 1 root root  44941 Dec 13 16:28 config.guess
-rw-r--r-- 1 root root 951459 Dec 30 16:15 config.log
-rwxr-xr-x 1 root root  67674 Dec 30 16:15 config.status
-rwxr-xr-x 1 root root  34423 Dec 13 16:28 config.sub
-rwxr-xr-x 1 root root 550027 Dec 30 16:11 configure
-rw-r--r-- 1 root root   6275 Dec 13 16:28 configure.ac
-rw-r--r-- 1 root root   2024 Dec 13 16:28 COPYRIGHT
-rwxr-xr-x 1 root root  18615 Dec 13 16:28 depcomp
-rwxr-xr-x 1 root root     66 Dec 13 16:28 get_local_rank
-rwxr-xr-x 1 root root  13663 Dec 13 16:28 install-sh
-rwxr-xr-x 1 root root 264446 Dec 30 16:15 libtool
-rwxr-xr-x 1 root root 243248 Dec 13 16:28 ltmain.sh
-rw-r--r-- 1 root root  26427 Dec 30 16:15 Makefile
-rw-r--r-- 1 root root    252 Dec 13 16:28 Makefile.am
-rw-r--r-- 1 root root  24933 Dec 30 16:11 Makefile.in
-rwxr-xr-x 1 root root  11419 Dec 13 16:28 missing
drwxr-xr-x 6 root root    155 Dec 30 16:15 mpi
drwxr-xr-x 3 root root   4096 Dec 30 16:15 openshmem
-rw-r--r-- 1 root root  46257 Dec 13 16:28 README
drwxr-xr-x 3 root root   4096 Dec 30 16:15 upc
drwxr-xr-x 3 root root   4096 Dec 30 16:15 upcxx
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox ...... I tried running

[root@bright88 pt2pt]# pwd
/cm/shared/OSU_Microbenchmarks/mpi/pt2pt
[root@bright88 pt2pt]#
[root@bright88 pt2pt]#
[root@bright88 pt2pt]# ll
total 844
-rw-r--r-- 1 root root    16 Dec 30 17:15 hostfile
-rw-r--r-- 1 root root 19453 Dec 30 17:02 Makefile
-rw-r--r-- 1 root root   784 Dec 13 16:28 Makefile.am
-rw-r--r-- 1 root root 18817 Dec 30 16:11 Makefile.in
-rwxr-xr-x 1 root root 68584 Dec 30 16:51 osu_bibw
-rw-r--r-- 1 root root  4528 Dec 13 16:28 osu_bibw.c
-rw-r--r-- 1 root root 30032 Dec 30 16:51 osu_bibw.o
-rwxr-xr-x 1 root root 68584 Dec 30 16:51 osu_bw
-rw-r--r-- 1 root root  4111 Dec 13 16:28 osu_bw.c
-rw-r--r-- 1 root root 29640 Dec 30 16:51 osu_bw.o
-rwxr-xr-x 1 root root 68032 Dec 30 16:51 osu_latency
-rw-r--r-- 1 root root  3705 Dec 13 16:28 osu_latency.c
-rwxr-xr-x 1 root root 78936 Dec 30 16:51 osu_latency_mt
-rw-r--r-- 1 root root  6879 Dec 13 16:28 osu_latency_mt.c
-rw-r--r-- 1 root root 51384 Dec 30 16:51 osu_latency_mt.o
-rw-r--r-- 1 root root 28000 Dec 30 16:51 osu_latency.o
-rwxr-xr-x 1 root root 51136 Dec 30 16:51 osu_mbw_mr
-rw-r--r-- 1 root root 10684 Dec 13 16:28 osu_mbw_mr.c
-rw-r--r-- 1 root root 61392 Dec 30 16:51 osu_mbw_mr.o
-rwxr-xr-x 1 root root 70944 Dec 30 16:51 osu_multi_lat
-rw-r--r-- 1 root root  4757 Dec 13 16:28 osu_multi_lat.c
-rw-r--r-- 1 root root 43080 Dec 30 16:51 osu_multi_lat.o
-rw-r--r-- 1 root root 17041 Dec 13 16:28 osu_pt2pt.c
-rw-r--r-- 1 root root  2779 Dec 13 16:28 osu_pt2pt.h
-rw-r--r-- 1 root root 67336 Dec 30 16:51 osu_pt2pt.o
[root@bright88 pt2pt]#
[root@bright88 pt2pt]#
[root@bright88 pt2pt]#
[root@bright88 pt2pt]# mpirun -np 2 -hostfile hostfile MV2_USE_CUDA=1 get_local_rank ./osu_latency D D
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:

  my node:   bright88
  target node:  node002

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
karanveersingh5623 commented 1 year ago

@Artemy-Mellanox ....managed to go further ....

[root@bright88 pt2pt]# /cm/shared/apps/openmpi4/gcc/4.1.2/bin/mpirun -np 2 -hostfile hostfile ./osu_bw D D
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           node001
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       rdmacm, udcm
--------------------------------------------------------------------------
# OSU MPI-CUDA Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size      Bandwidth (MB/s)
[node001:1935046] *** Process received signal ***
[node001:1935046] Signal: Segmentation fault (11)
[node001:1935046] Signal code: Invalid permissions (2)
[node001:1935046] Failing at address: 0x155523200000
[node001:1935046] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x155553341cf0]
[node001:1935046] [ 1] /lib64/libc.so.6(+0xd003c)[0x15555303903c]
[node001:1935046] [ 2] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/libopen-pal.so.40(opal_convertor_pack+0x1a8)[0x155552633808]
[node001:1935046] [ 3] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/openmpi/mca_btl_vader.so(mca_btl_vader_sendi+0x11c)[0x15550c4b3d6c]
[node001:1935046] [ 4] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/openmpi/mca_pml_ob1.so(+0xade4)[0x155506312de4]
[node001:1935046] [ 5] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x4ff)[0x155506313a8f]
[node001:1935046] [ 6] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/libmpi.so.40(MPI_Isend+0x125)[0x1555535d3045]
[node001:1935046] [ 7] ./osu_bw[0x40128b]
[node001:1935046] [ 8] /lib64/libc.so.6(__libc_start_main+0xe5)[0x155552fa3d85]
[node001:1935046] [ 9] ./osu_bw[0x4015ee]
[node001:1935046] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1935046 on node node001 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[bright88:3782908] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[bright88:3782908] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[root@bright88 pt2pt]#
[root@bright88 pt2pt]# /cm/shared/apps/openmpi4/gcc/4.1.2/bin/mpirun -np 2 -hostfile hostfile MV2_USE_CUDA=1 ./osu_bw D H
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       node001
Executable: MV2_USE_CUDA=1
--------------------------------------------------------------------------
2 total processes failed to start
[root@bright88 pt2pt]#
yosefe commented 1 year ago

@karanveersingh5623 can you pls add "-mca pml ucx -mca btl self --report-bindings" to mpirun command?

karanveersingh5623 commented 1 year ago

@yosefe @Artemy-Mellanox Below is the output

[root@bright88 pt2pt]# /cm/shared/apps/openmpi4/gcc/4.1.2/bin/mpirun --mca pml ucx --mca btl self --report-bindings -np 2 -hostfile hostfile MV2_USE_CUDA=1 ./osu_bw D H
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       node001
Executable: MV2_USE_CUDA=1
--------------------------------------------------------------------------
2 total processes failed to start
[root@bright88 pt2pt]#
yosefe commented 1 year ago

pls remove 'MV2_USE_CUDA=1 '