Open karanveersingh5623 opened 2 years ago
@Artemy-Mellanox , if you can share the steps how i can fix this issue in docker container , it will be of great help
@karanveersingh5623 can you pls add the following env vars to init_datasets.sh and post the output:
export UCX_LOG_LEVEL=info
export UCX_IB_MLX5_DEVX=no
@yosefe , thanks for coming back , please refer below
# ENROOT_ALLOW_HTTP=yes srun --mpi=pmix_v3 -N 1 -G 4 --ntasks=4 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=cosmoflow-preprocess --container-workdir=/mnt/ --container-mounts=/mnt/lustre:/mnt bash /mnt/mxnet/tools/init_datasets.sh
[node001:113245] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:113246] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:113244] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:113243] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
2022-08-17 09:56:06.381232: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-17 09:56:06.384032: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-17 09:56:06.384105: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-17 09:56:06.389060: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-17 09:56:07.581630: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-08-17 09:56:07.590577: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory: -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2022-08-17 09:56:07.743845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory: -> device: 3, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:e3:00.0, compute capability: 8.0
2022-08-17 09:56:07.752427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory: -> device: 2, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2022-08-17 09:56:08.287462: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.295460: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.383547: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.400239: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.423831: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.453386: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.461215: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.484786: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.504828: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.533111: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.553933: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.582050: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.584392: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.620111: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.635133: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.659646: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.676282: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.693794: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.720174: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.728267: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.776139: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.776139: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.802449: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.815474: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.863527: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.881359: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.881356: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.906286: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.960325: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.978037: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:08.991543: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-17 09:56:09.040572: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
[1660697766.019921] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.026438] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.030515] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.049244] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.069389] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.105421] [node001:113244:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.105529] [node001:113244:0] parser.c:1893 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660697766.108265] [node001:113244:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660697766.123882] [node001:113244:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.128259] [node001:113244:async] ucp_worker.c:1956 UCX INFO ep_cfg[3]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.203819] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.209559] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.213760] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.233378] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.253477] [node001:113244:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.296460] [node001:113244:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.300908] [node001:113244:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660697766.317866] [node001:113244:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660697766.019923] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.026411] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.030528] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.049067] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.069472] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.105457] [node001:113245:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.105559] [node001:113245:0] parser.c:1893 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660697766.108321] [node001:113245:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660697766.123627] [node001:113245:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.203797] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.209569] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.213757] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.233370] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.253381] [node001:113245:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.296388] [node001:113245:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.300893] [node001:113245:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660697766.319414] [node001:113245:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660697766.019920] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.026445] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.030500] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.049696] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.069474] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.105492] [node001:113243:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.105596] [node001:113243:0] parser.c:1893 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660697766.108324] [node001:113243:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660697766.124026] [node001:113243:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.203793] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.209544] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.213774] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.233370] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.253432] [node001:113243:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.296362] [node001:113243:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.300899] [node001:113243:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660697766.320338] [node001:113243:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660697766.020258] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.026422] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.030521] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.049056] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.069449] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.105392] [node001:113246:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.105503] [node001:113246:0] parser.c:1893 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660697766.108265] [node001:113246:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660697766.123514] [node001:113246:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660697766.203795] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.209549] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.213748] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.233357] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.253418] [node001:113246:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660697766.296425] [node001:113246:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660697766.300915] [node001:113246:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660697766.320306] [node001:113246:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[node001:114865] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:114865] PMIX ERROR: NOT-FOUND in file ptl_usock.c at line 175
[node001:114865] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
@karanveersingh5623 does it mean the job managed to run with these parameters?
@karanveersingh5623 does it mean the job managed to run with these parameters?
@yosefe , first job managed to run i.e training dataset , but when it goes to validation dataset , it fails. Above parameters you mentioned just gave extra trace lines
python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/train -o ${DATA_DST_DIR}/train -c gzip
python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/validation -o ${DATA_DST_DIR}/validation -c gzip
@karanveersingh5623 Is that other failure related to UCX? Since i don't see any more UCX-related errors in the output
@karanveersingh5623 Is that other failure related to UCX? Since i don't see any more UCX-related errors in the output
Below error is not UCX related ??
[node001:63628] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[1660021420.383880] [node001:63628:0] rc_mlx5_devx.c:99 UCX ERROR mlx5dv_devx_create_event_channel() failed: Protocol not supported
[node001:63628] pml_ucx.c:309 Error: Failed to create UCP worker
Below error is not UCX related ??
Don't see it in https://github.com/openucx/ucx/issues/8440#issuecomment-1217329624,
does it still happen after adding export UCX_IB_MLX5_DEVX=no
?
@yosefe ...oh ok....yea now those are gone but still issue with communications within....dont know what is causing the failed msgs...as a request if you can point me in some direction , it would be helpful
@karanveersingh5623 could you please run ucx_info -v
and post output so we'd know we don't miss something
@Artemy-Mellanox , below is shell script i ran within docker container
DATA_SRC_DIR="/mnt/cosmoUniverse_2019_05_4parE_tf_small"
DATA_DST_DIR="/mnt/processed"
export UCX_LOG_LEVEL=info
export UCX_IB_MLX5_DEVX=no
ucx_info -v
python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/train -o ${DATA_DST_DIR}/train -c gzip
python3 /mnt/mxnet/tools/convert_tfrecord_to_numpy.py -i ${DATA_SRC_DIR}/validation -o ${DATA_DST_DIR}/validation -c gzip
ls -1 ${DATA_DST_DIR}/train | grep "_data.npy" | sort > ${DATA_DST_DIR}/train/files_data.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_data.npy" | sort > ${DATA_DST_DIR}/validation/files_data.lst
ls -1 ${DATA_DST_DIR}/train | grep "_label.npy" | sort > ${DATA_DST_DIR}/train/files_label.lst
ls -1 ${DATA_DST_DIR}/validation | grep "_label.npy" | sort > ${DATA_DST_DIR}/validation/files_label.lst
below is the trace generated , train finishes without issues , validation fails --> just two python cmds running in a shell script and i am using just single host
[root@bright88 burst-buffer]# ENROOT_ALLOW_HTTP=yes srun --mpi=pmix_v3 -N 1 -G 4 --ntasks=4 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=cosmoflow-preprocess --container-workdir=/mnt/ --container-mounts=/mnt/lustre:/mnt bash /mnt/mxnet/tools/init_datasets.sh
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
[node001:41715] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:41717] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:41718] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:41716] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
2022-08-19 10:30:56.823437: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 10:30:56.823818: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 10:30:56.823995: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 10:30:56.824052: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-08-19 10:30:58.028348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory: -> device: 3, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:e3:00.0, compute capability: 8.0
2022-08-19 10:30:58.029114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory: -> device: 0, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:17:00.0, compute capability: 8.0
2022-08-19 10:30:58.029785: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory: -> device: 2, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:ca:00.0, compute capability: 8.0
2022-08-19 10:30:58.265219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 79135 MB memory: -> device: 1, name: NVIDIA A100 80GB PCIe, pci bus id: 0000:65:00.0, compute capability: 8.0
2022-08-19 10:30:58.862638: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.877999: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.884026: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.942857: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.958647: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.973721: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:58.979046: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.027725: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.033068: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.073127: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.074661: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.112304: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.120149: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.171104: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.171103: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.196387: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.211329: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.251460: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.265213: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.279228: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.293308: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.326552: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.342905: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.359793: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.374258: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.401641: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.419742: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.438224: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.454582: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.480291: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.496656: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
2022-08-19 10:30:59.516348: E tensorflow/core/lib/io/record_reader.cc:50] Unsupported compression_type:gzip. No compression will be used.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
Found 8 files, 0 are done, 8 are remaining.
[1660872656.478887] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.482134] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.485205] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.500491] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.509412] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.529735] [node001:41718:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.529846] [node001:41718:0] parser.c:1893 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660872656.556293] [node001:41718:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660872656.572946] [node001:41718:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.649737] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.655384] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.660118] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.679927] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.698743] [node001:41718:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.738389] [node001:41718:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.739800] [node001:41718:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660872656.758164] [node001:41718:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.478888] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.482136] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.485145] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.500625] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.509407] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.529695] [node001:41715:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.529804] [node001:41715:0] parser.c:1893 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660872656.556233] [node001:41715:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660872656.570917] [node001:41715:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.649765] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.655468] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.660160] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.679731] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.698094] [node001:41715:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.738323] [node001:41715:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.739786] [node001:41715:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660872656.758145] [node001:41715:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.504776] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.508595] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.511891] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.528701] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.537382] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.554993] [node001:41716:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.555091] [node001:41716:0] parser.c:1893 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660872656.556214] [node001:41716:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660872656.570769] [node001:41716:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.649756] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.655472] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.660162] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.679597] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.698706] [node001:41716:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.738430] [node001:41716:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.739808] [node001:41716:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660872656.759937] [node001:41716:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.762306] [node001:41716:0] ucp_worker.c:1956 UCX INFO ep_cfg[3]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.504772] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.508626] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.513745] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.528710] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.537378] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.554950] [node001:41717:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.555051] [node001:41717:0] parser.c:1893 UCX INFO UCX_* env variables: UCX_LOG_LEVEL=info UCX_IB_MLX5_DEVX=no
[1660872656.556322] [node001:41717:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda);
[1660872656.573022] [node001:41717:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.575952] [node001:41717:async] ucp_worker.c:1956 UCX INFO ep_cfg[3]: tag(posix/memory cma/memory rc_mlx5/mlx5_1:1 cuda_ipc/cuda);
[1660872656.649745] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.655512] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.660165] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.679618] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.698098] [node001:41717:0] sock.c:128 UCX DIAG failed to read from /sys/class/net/eth6/bonding/ad_num_ports: No such file or directory, assuming 802.3ad bonding is disabled
[1660872656.738355] [node001:41717:0] ucp_worker.c:1956 UCX INFO ep_cfg[0]: tag(cuda_copy/cuda); rma(cuda_copy/cuda);
[1660872656.739798] [node001:41717:0] ucp_worker.c:1956 UCX INFO ep_cfg[1]: tag(self/memory0 cma/memory rc_mlx5/mlx5_1:1 cuda_copy/cuda); rma(self/memory0 posix/memory sysv/memory);
[1660872656.759921] [node001:41717:0] ucp_worker.c:1956 UCX INFO ep_cfg[2]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[1660872656.762297] [node001:41717:0] ucp_worker.c:1956 UCX INFO ep_cfg[3]: tag(rc_mlx5/mlx5_1:1 posix/memory cma/memory cuda_ipc/cuda); rma(rc_mlx5/mlx5_1:1 posix/memory sysv/memory);
[node001:43337] PMIX ERROR: ERROR in file gds_ds12_lock_pthread.c at line 168
[node001:43337] PMIX ERROR: NOT-FOUND in file ptl_usock.c at line 175
[node001:43337] OPAL ERROR: Unreachable in file pmix3x_client.c at line 112
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
This issue was fixed UCX version 1.12.x so you may either upgrade or use export UCX_IB_MLX5_DEVX=no
which has same effect.
@Artemy-Mellanox @yosefe
Below is the issue i am facing , please check the trace
[1667471376.875185] [node002:217310:0] select.c:513 UCX ERROR no active messages transport to
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=2 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
STARTING TIMING RUN AT 2022-11-03 07:29:27 PM
running benchmark
STARTING TIMING RUN AT 2022-11-03 07:29:27 PM
running benchmark
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=16-31,80-95 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=0-15,64-79 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
STARTING TIMING RUN AT 2022-11-03 07:29:35 PM
running benchmark
STARTING TIMING RUN AT 2022-11-03 07:29:35 PM
running benchmark
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=22
num_sockets = 2 num_nodes=2 cores_per_socket=22
+ exec numactl --physcpubind=22-43,66-87 --membind=1 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
+ exec numactl --physcpubind=0-21,44-65 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
[1667471376.843507] [node002:217310:0] ucp_context.c:780 UCX WARN network device 'mlx5_0:1' is not available, please use one or more of: 'eth4'(tcp), 'lo'(tcp)
[1667471376.868844] [node002:217310:0] parser.c:1885 UCX WARN unused env variable: UCX_IB_MLX5_DEVX (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1667471376.875185] [node002:217310:0] select.c:513 UCX ERROR no active messages transport to <no debug data>: Unsupported operation
[node002:217310] pml_ucx.c:419 Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[node002:217310] pml_ucx.c:472 Error: Failed to resolve UCX endpoint for rank 0
Error in MPI_Isend(52788624, 1, 0x1554426edce0, 0, -27, 23451636022496) (-1)
Error in NBC_Start_round() (-1)
Error in NBC_Start_round() (-1)
@yosefe @Artemy-Mellanox Below is another trace from just using 1 task-per-node .
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4 dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number STARTING TIMING RUN AT 2022-11-04 10:39:09 AM running benchmark dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number num_sockets = 2 num_nodes=2 cores_per_socket=32
Segmentation fault: 11
terminate called after throwing an instance of 'std::system_error' what(): Resource deadlock avoided [node002:276417] Process received signal [node002:276417] Signal: Aborted (6) [node002:276417] Signal code: (-6) [node002:276417] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x15555536a210] [node002:276417] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x15555536a18b] [node002:276417] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x155555349859] [node002:276417] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x15550c6d0911] [node002:276417] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x15550c6dc38c] [node002:276417] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x15550c6db369] [node002:276417] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(gxx_personality_v0+0x2a1)[0x15550c6dbd21] [node002:276417] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x155554e6dbef] [node002:276417] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x331)[0x155554e6e281] [node002:276417] [ 9] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3c)[0x15550c6dc69c] [node002:276417] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20throw_system_errori+0x98)[0x15550c6d373f] [node002:276417] [11] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt6thread6detachEv+0x0)[0x15550c709060] [node002:276417] [12] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common18HorovodGlobalStateD1Ev+0xb58)[0x1554427a97d8] [node002:276417] [13] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x49a27)[0x15555536da27] [node002:276417] [14] /usr/lib/x86_64-linux-gnu/libc.so.6(on_exit+0x0)[0x15555536dbe0] [node002:276417] [15] /usr/local/lib/libmxnet.so(+0x17ee46f)[0x1554d4a7b46f] [node002:276417] [16] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0)[0x1555553163c0] [node002:276417] [17] /opt/hpcx/ompi/lib/libmpi.so.40(ompi_comm_split_type+0xb1)[0x15544262de31] [node002:276417] [18] /opt/hpcx/ompi/lib/libmpi.so.40(PMPI_Comm_split_type+0x2e)[0x15544266442e] [node002:276417] [19] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(_ZN7horovod6common10MPIContext10InitializeERKSt6vectorIiSaIiEERNS0_17MPIContextManagerE+0x17c)[0x1554427ecedc] [node002:276417] [20] /usr/local/lib/python3.8/dist-packages/horovod/mxnet/mpi_lib.cpython-38-x86_64-linux-gnu.so(+0x9d9ec)[0x1554427a39ec] [node002:276417] [21] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xd6de4)[0x15550c708de4] [node002:276417] [22] /usr/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609)[0x15555530a609] [node002:276417] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x43)[0x155555446293] [node002:276417] End of error message ./run_and_time.sh: line 211: 276417 Aborted (core dumped) ${LOGGER:-} ${DISTRIBUTED} ${BIND} python train.py "${PARAMS[@]}" slurmstepd: error: mpi/pmix_v3: _errhandler: node002 [1]: pmixp_client_v2.c:211: Error handler invoked: status = -25, source = [slurm.pmix.46.0:1] srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: STEP 46.0 ON node001 CANCELLED AT 2022-11-04T10:39:25 srun: error: node002: task 1: Killed srun: error: node001: task 0: Killed
Could you please run ucx_info -bdvc
and ofed_info
and attach output here.
Run in container like
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ucx_info -bdvc
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ofed_info
@Artemy-Mellanox
Please refer below
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ucx_info -bdvc
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
#define UCX_CONFIG_H
#define ENABLE_BUILTIN_MEMCPY 1
#define ENABLE_DEBUG_DATA 0
#define ENABLE_MT 1
#define ENABLE_PARAMS_CHECK 0
#define HAVE_1_ARG_BFD_SECTION_SIZE 0
#define HAVE_ALLOCA 1
#define HAVE_ALLOCA_H 1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV 1
#define HAVE_CPU_SET_T 1
#define HAVE_CUDA 1
#define HAVE_CUDA_H 1
#define HAVE_CUDA_RUNTIME_H 1
#define HAVE_DC_DV 1
#define HAVE_DECL_ASPRINTF 1
#define HAVE_DECL_BASENAME 1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 0
#define HAVE_DECL_BFD_GET_SECTION_VMA 0
#define HAVE_DECL_BFD_SECTION_FLAGS 0
#define HAVE_DECL_BFD_SECTION_VMA 0
#define HAVE_DECL_CPU_ISSET 1
#define HAVE_DECL_CPU_ZERO 1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN 1
#define HAVE_DECL_FUSE_MOUNT 0
#define HAVE_DECL_FUSE_OPEN_CHANNEL 0
#define HAVE_DECL_FUSE_UNMOUNT 0
#define HAVE_DECL_F_SETOWN_EX 1
#define HAVE_DECL_GDR_COPY_TO_MAPPING 1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 1
#define HAVE_DECL_IBV_ADVISE_MR 1
#define HAVE_DECL_IBV_ALLOC_DM 1
#define HAVE_DECL_IBV_ALLOC_TD 1
#define HAVE_DECL_IBV_CMD_MODIFY_QP 0
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ 1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 0
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 0
#define HAVE_DECL_IBV_EXP_ALLOC_DM 0
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 0
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 0
#define HAVE_DECL_IBV_EXP_CREATE_QP 0
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 0
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 0
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 0
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 0
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 0
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 0
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_EXP_POST_SEND 0
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 0
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 0
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 0
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 0
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 0
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 0
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 0
#define HAVE_DECL_IBV_EXP_REG_MR 0
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 0
#define HAVE_DECL_IBV_EXP_SETENV 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 0
#define HAVE_DECL_IBV_EXP_WR_NOP 0
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 1
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID 1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_INOTIFY_ADD_WATCH 1
#define HAVE_DECL_INOTIFY_INIT 1
#define HAVE_DECL_IN_ATTRIB 1
#define HAVE_DECL_IPPROTO_TCP 1
#define HAVE_DECL_MADV_FREE 1
#define HAVE_DECL_MADV_REMOVE 1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 1
#define HAVE_DECL_MLX5DV_CREATE_QP 1
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 1
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 1
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 1
#define HAVE_DECL_MLX5DV_OBJ_AH 1
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 1
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_BF 0
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_NC 0
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER 1
#define HAVE_DECL_SOL_SOCKET 1
#define HAVE_DECL_SO_KEEPALIVE 1
#define HAVE_DECL_SPEED_UNKNOWN 1
#define HAVE_DECL_STRERROR_R 1
#define HAVE_DECL_SYS_BRK 1
#define HAVE_DECL_SYS_IPC 0
#define HAVE_DECL_SYS_MADVISE 1
#define HAVE_DECL_SYS_MMAP 1
#define HAVE_DECL_SYS_MREMAP 1
#define HAVE_DECL_SYS_MUNMAP 1
#define HAVE_DECL_SYS_SHMAT 1
#define HAVE_DECL_SYS_SHMDT 1
#define HAVE_DECL_TCP_KEEPCNT 1
#define HAVE_DECL_TCP_KEEPIDLE 1
#define HAVE_DECL_TCP_KEEPINTVL 1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DEVX 1
#define HAVE_DLFCN_H 1
#define HAVE_GDRAPI_H 1
#define HAVE_HW_TIMER 1
#define HAVE_IB 1
#define HAVE_IBV_DM 1
#define HAVE_IN6_ADDR_S6_ADDR32 1
#define HAVE_INFINIBAND_MLX5DV_H 1
#define HAVE_INFINIBAND_TM_TYPES_H 1
#define HAVE_INOTIFY 1
#define HAVE_INTTYPES_H 1
#define HAVE_IP_IP_DST 1
#define HAVE_LIBGEN_H 1
#define HAVE_LIBRT 1
#define HAVE_LINUX_FUTEX_H 1
#define HAVE_LINUX_IP_H 1
#define HAVE_LINUX_MMAN_H 1
#define HAVE_MALLOC_H 1
#define HAVE_MALLOC_HOOK 1
#define HAVE_MALLOC_TRIM 1
#define HAVE_MEMALIGN 1
#define HAVE_MEMORY_H 1
#define HAVE_MLX5_HW 1
#define HAVE_MLX5_HW_UD 1
#define HAVE_MREMAP 1
#define HAVE_NETINET_IP_H 1
#define HAVE_NET_ETHERNET_H 1
#define HAVE_NUMA 1
#define HAVE_NUMAIF_H 1
#define HAVE_NUMA_H 1
#define HAVE_ODP 1
#define HAVE_ODP_IMPLICIT 1
#define HAVE_POSIX_MEMALIGN 1
#define HAVE_PREFETCH 1
#define HAVE_SCHED_GETAFFINITY 1
#define HAVE_SCHED_SETAFFINITY 1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T 1
#define HAVE_STDINT_H 1
#define HAVE_STDLIB_H 1
#define HAVE_STRERROR_R 1
#define HAVE_STRINGS_H 1
#define HAVE_STRING_H 1
#define HAVE_STRUCT_BITMASK 1
#define HAVE_STRUCT_DL_PHDR_INFO 1
#define HAVE_STRUCT_IBV_DEVICE_ATTR_EX_PCI_ATOMIC_CAPS 1
#define HAVE_STRUCT_IBV_TM_CAPS_FLAGS 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_SYS_EPOLL_H 1
#define HAVE_SYS_EVENTFD_H 1
#define HAVE_SYS_STAT_H 1
#define HAVE_SYS_TYPES_H 1
#define HAVE_SYS_UIO_H 1
#define HAVE_TL_DC 1
#define HAVE_TL_RC 1
#define HAVE_TL_UD 1
#define HAVE_UCM_PTMALLOC286 1
#define HAVE_UNISTD_H 1
#define HAVE___CLEAR_CACHE 1
#define HAVE___CURBRK 1
#define HAVE___SIGHANDLER_T 1
#define IBV_HW_TM 1
#define LT_OBJDIR ".libs/"
#define NVALGRIND 1
#define PACKAGE "ucx"
#define PACKAGE_BUGREPORT ""
#define PACKAGE_NAME "ucx"
#define PACKAGE_STRING "ucx 1.11"
#define PACKAGE_TARNAME "ucx"
#define PACKAGE_URL ""
#define PACKAGE_VERSION "1.11"
#define STDC_HEADERS 1
#define STRERROR_R_CHAR_P 1
#define UCM_BISTRO_HOOKS 1
#define UCS_MAX_LOG_LEVEL UCS_LOG_LEVEL_DEBUG
#define UCT_TCP_EP_KEEPALIVE 1
#define UCT_UD_EP_DEBUG_HOOKS 0
#define UCX_CONFIGURE_FLAGS "--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt"
#define UCX_MODULE_SUBDIR "ucx"
#define VERSION "1.11"
#define restrict __restrict
#define test_MODULES ":module"
#define ucg_MODULES ":builtin"
#define ucm_MODULES ":cuda"
#define ucs_MODULES ""
#define uct_MODULES ":cuda:ib:rdmacm:cma:xpmem"
#define uct_cuda_MODULES ":gdrcopy"
#define uct_ib_MODULES ""
#define uct_rocm_MODULES ""
#define ucx_perftest_MODULES ":cuda"
#
# Memory domain: posix
# Component: posix
# allocate: unlimited
# remote key: 24 bytes
# rkey_ptr is supported
#
# Transport: posix
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
#
# Transport: sysv
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: self
# Device: memory0
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: tcp
# Device: lo
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: eth3
# System device: <unknown>
#
# capabilities:
# bandwidth: 11316.36/ppn + 0.00 MB/sec
# latency: 5206 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: cuda_cpy
# Component: cuda_cpy
# allocate: unlimited
# register: unlimited, cost: 0 nsec
#
# Transport: cuda_copy
# Device: cuda
# System device: <unknown>
#
# capabilities:
# bandwidth: 10000.00/ppn + 0.00 MB/sec
# latency: 8000 nsec
# overhead: 0 nsec
# put_short: <= 4294967295
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_short: <= 4294967295
# get_zcopy: unlimited, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
#
#
# Memory domain: cuda_ipc
# Component: cuda_ipc
# register: unlimited, cost: 0 nsec
# remote key: 112 bytes
#
# Transport: cuda_ipc
# Device: cuda
# System device: <unknown>
#
# capabilities:
# bandwidth: 300000.00/ppn + 0.00 MB/sec
# latency: 1 nsec
# overhead: 0 nsec
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
# < failed to open connection manager rdmacm >
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 400 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
UCX_LOG_LEVEL=WARN
UCX_LOG_FILE_FILTER=*
UCX_LOG_FILE=
UCX_LOG_FILE_SIZE=inf
UCX_LOG_FILE_ROTATE=0
UCX_LOG_BUFFER=1K
UCX_LOG_DATA_SIZE=0
UCX_LOG_PRINT_ENABLE=n
UCX_HANDLE_ERRORS=bt
UCX_ERROR_SIGNALS=ILL,SEGV,BUS,FPE
UCX_ERROR_MAIL_TO=
UCX_ERROR_MAIL_FOOTER=
UCX_GDB_COMMAND=gdb -quiet
UCX_DEBUG_SIGNO=HUP
UCX_LOG_LEVEL_TRIGGER=FATAL
UCX_WARN_UNUSED_ENV_VARS=y
UCX_ASYNC_MAX_EVENTS=1024
UCX_ASYNC_SIGNO=ALRM
UCX_VFS_ENABLE=y
UCX_PROFILE_MODE=
UCX_PROFILE_FILE=ucx_%h_%p.prof
UCX_PROFILE_LOG_SIZE=4M
UCX_RCACHE_CHECK_PFN=0
UCX_MODULE_DIR=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt/lib/ucx
UCX_MODULE_LOG_LEVEL=TRACE
UCX_BUILTIN_MEMCPY_MIN=auto
UCX_BUILTIN_MEMCPY_MAX=auto
UCX_MEM_LOG_LEVEL=WARN
UCX_MEM_ALLOC_ALIGN=16
UCX_MEM_EVENTS=y
UCX_MEM_MMAP_HOOK_MODE=bistro
UCX_MEM_MALLOC_HOOKS=y
UCX_MEM_MALLOC_RELOC=y
UCX_MEM_CUDA_HOOK_MODE=bistro
UCX_MEM_DYNAMIC_MMAP_THRESH=y
UCX_MEM_DLOPEN_PROCESS_RPATH=y
UCX_MEM_MODULE_UNLOAD_PREVENT_MODE=lazy
UCX_POSIX_HUGETLB_MODE=try
UCX_POSIX_DIR=/dev/shm
UCX_POSIX_USE_PROC_LINK=y
UCX_POSIX_ALLOC=md,mmap,heap
UCX_POSIX_FAILURE=DIAG
UCX_POSIX_MAX_NUM_EPS=inf
UCX_POSIX_BW=12179.00MBps
UCX_POSIX_FIFO_SIZE=64
UCX_POSIX_SEG_SIZE=8256
UCX_POSIX_FIFO_RELEASE_FACTOR=0.500
UCX_POSIX_RX_MAX_BUFS=-1
UCX_POSIX_RX_BUFS_GROW=512
UCX_POSIX_FIFO_HUGETLB=n
UCX_POSIX_FIFO_ELEM_SIZE=128
UCX_POSIX_FIFO_MAX_POLL=16
UCX_POSIX_ERROR_HANDLING=n
UCX_SYSV_HUGETLB_MODE=try
UCX_SYSV_ALLOC=md,mmap,heap
UCX_SYSV_FAILURE=DIAG
UCX_SYSV_MAX_NUM_EPS=inf
UCX_SYSV_BW=12179.00MBps
UCX_SYSV_FIFO_SIZE=64
UCX_SYSV_SEG_SIZE=8256
UCX_SYSV_FIFO_RELEASE_FACTOR=0.500
UCX_SYSV_RX_MAX_BUFS=-1
UCX_SYSV_RX_BUFS_GROW=512
UCX_SYSV_FIFO_HUGETLB=n
UCX_SYSV_FIFO_ELEM_SIZE=128
UCX_SYSV_FIFO_MAX_POLL=16
UCX_SYSV_ERROR_HANDLING=n
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_SELF_FAILURE=DIAG
UCX_SELF_MAX_NUM_EPS=inf
UCX_SELF_SEG_SIZE=8K
UCX_SELF_NUM_DEVICES=1
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_FAILURE=DIAG
UCX_TCP_MAX_NUM_EPS=256
UCX_TCP_TX_SEG_SIZE=8K
UCX_TCP_RX_SEG_SIZE=64K
UCX_TCP_MAX_IOV=6
UCX_TCP_SENDV_THRESH=2K
UCX_TCP_PREFER_DEFAULT=y
UCX_TCP_PUT_ENABLE=y
UCX_TCP_CONN_NB=n
UCX_TCP_MAX_POLL=16
UCX_TCP_MAX_CONN_RETRIES=25
UCX_TCP_NODELAY=y
UCX_TCP_SNDBUF=auto
UCX_TCP_RCVBUF=auto
UCX_TCP_SYN_CNT=auto
UCX_TCP_TX_MAX_BUFS=-1
UCX_TCP_TX_BUFS_GROW=8
UCX_TCP_RX_MAX_BUFS=-1
UCX_TCP_RX_BUFS_GROW=8
UCX_TCP_PORT_RANGE=0
UCX_TCP_KEEPIDLE=10000000.00us
UCX_TCP_KEEPCNT=3
UCX_TCP_KEEPINTVL=1000000.00us
UCX_TCP_CM_FAILURE=DIAG
UCX_TCP_CM_REUSEADDR=n
UCX_TCP_CM_PRIV_DATA_LEN=2K
UCX_TCP_CM_SNDBUF=auto
UCX_TCP_CM_RCVBUF=auto
UCX_TCP_CM_SYN_CNT=auto
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_TLS_PRIORITY=rdmacm,tcp,sockcm
UCX_SOCKADDR_AUX_TLS=ud
UCX_SELECT_DISTANCE_MD=cuda_cpy
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=0
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256K
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MULTI_LANE_MAX_RATIO=4.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=2
UCX_RNDV_SCHEME=auto
UCX_RKEY_PTR_SEG_SIZE=512K
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=auto
UCX_ATOMIC_MODE=guess
UCX_ADDRESS_DEBUG_INFO=n
UCX_MAX_WORKER_ADDRESS_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8K
UCX_TM_THRESH=1K
UCX_TM_MAX_BB_SIZE=1K
UCX_TM_FORCE_THRESH=8K
UCX_TM_SW_RNDV=n
UCX_NUM_EPS=auto
UCX_NUM_PPN=auto
UCX_RNDV_FRAG_SIZE=512K
UCX_RNDV_PIPELINE_SEND_THRESH=inf
UCX_MEMTYPE_CACHE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
UCX_CM_USE_ALL_DEVICES=y
UCX_LISTENER_BACKLOG=auto
UCX_PROTO_ENABLE=n
UCX_KEEPALIVE_INTERVAL=60000000.00us
UCX_KEEPALIVE_NUM_EPS=128
UCX_PROTO_INDIRECT_ID=auto
UCX_ERROR_HANDLER_DELAY=0.00us
UCX_CUDA_COPY_ALLOC=huge,thp,md,mmap,heap
UCX_CUDA_COPY_FAILURE=DIAG
UCX_CUDA_COPY_MAX_NUM_EPS=inf
UCX_CUDA_COPY_MAX_POLL=16
UCX_CUDA_COPY_MAX_EVENTS=inf
UCX_CUDA_IPC_ALLOC=huge,thp,md,mmap,heap
UCX_CUDA_IPC_FAILURE=DIAG
UCX_CUDA_IPC_MAX_NUM_EPS=inf
UCX_CUDA_IPC_MAX_POLL=16
UCX_CUDA_IPC_MAX_STREAMS=16
UCX_CUDA_IPC_CACHE=y
UCX_CUDA_IPC_MAX_EVENTS=inf
UCX_GDR_COPY_RCACHE=try
UCX_GDR_COPY_RCACHE_MEM_PRIO=1000
UCX_GDR_COPY_RCACHE_OVERHEAD=0.18us
UCX_GDR_COPY_RCACHE_ADDR_ALIGN=65536
UCX_GDR_COPY_RCACHE_MAX_REGIONS=inf
UCX_GDR_COPY_RCACHE_MAX_SIZE=inf
UCX_GDR_COPY_MEM_REG_OVERHEAD=16.00us
UCX_GDR_COPY_MEM_REG_GROWTH=0.00us
UCX_GDR_COPY_ALLOC=huge,thp,md,mmap,heap
UCX_GDR_COPY_FAILURE=DIAG
UCX_GDR_COPY_MAX_NUM_EPS=inf
UCX_IB_REG_METHODS=rcache,odp,direct
UCX_IB_RCACHE_MEM_PRIO=1000
UCX_IB_RCACHE_OVERHEAD=0.18us
UCX_IB_RCACHE_ADDR_ALIGN=16
UCX_IB_RCACHE_MAX_REGIONS=inf
UCX_IB_RCACHE_MAX_SIZE=inf
UCX_IB_MEM_REG_OVERHEAD=16.00us
UCX_IB_MEM_REG_GROWTH=0.00us
UCX_IB_FORK_INIT=try
UCX_IB_ASYNC_EVENTS=y
UCX_IB_ETH_PAUSE_ON=y
UCX_IB_ODP_NUMA_POLICY=preferred
UCX_IB_ODP_PREFETCH=n
UCX_IB_ODP_MAX_SIZE=auto
UCX_IB_DEVICE_SPECS=
UCX_IB_PREFER_NEAREST_DEVICE=y
UCX_IB_INDIRECT_ATOMIC=y
UCX_IB_GID_INDEX=auto
UCX_IB_SUBNET_PREFIX=
UCX_IB_GPU_DIRECT_RDMA=try
UCX_IB_PCI_BW=
UCX_IB_MLX5_DEVX=n
UCX_IB_MLX5_DEVX_OBJECTS=rcqp,rcsrq,dct,dcsrq,dci
UCX_IB_REG_MT_THRESH=4G
UCX_IB_REG_MT_CHUNK=2G
UCX_IB_REG_MT_BIND=n
UCX_IB_PCI_RELAXED_ORDERING=auto
UCX_RC_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_RC_VERBS_FAILURE=DIAG
UCX_RC_VERBS_MAX_NUM_EPS=256
UCX_RC_VERBS_SEG_SIZE=8256
UCX_RC_VERBS_TX_QUEUE_LEN=256
UCX_RC_VERBS_TX_MAX_BATCH=16
UCX_RC_VERBS_TX_MAX_POLL=16
UCX_RC_VERBS_TX_MIN_INLINE=64
UCX_RC_VERBS_TX_INLINE_RESP=64
UCX_RC_VERBS_TX_MIN_SGE=4
UCX_RC_VERBS_TX_MAX_BUFS=-1
UCX_RC_VERBS_TX_BUFS_GROW=1024
UCX_RC_VERBS_RX_QUEUE_LEN=4095
UCX_RC_VERBS_RX_MAX_BATCH=16
UCX_RC_VERBS_RX_MAX_POLL=16
UCX_RC_VERBS_RX_INLINE=64
UCX_RC_VERBS_RX_MAX_BUFS=-1
UCX_RC_VERBS_RX_BUFS_GROW=0
UCX_RC_VERBS_ADDR_TYPE=auto
UCX_RC_VERBS_IS_GLOBAL=n
UCX_RC_VERBS_SL=auto
UCX_RC_VERBS_TRAFFIC_CLASS=auto
UCX_RC_VERBS_HOP_LIMIT=255
UCX_RC_VERBS_NUM_PATHS=auto
UCX_RC_VERBS_ROCE_LOCAL_SUBNET=n
UCX_RC_VERBS_ROCE_PATH_FACTOR=1
UCX_RC_VERBS_LID_PATH_BITS=0
UCX_RC_VERBS_PKEY=auto
UCX_RC_VERBS_PATH_MTU=default
UCX_RC_VERBS_MAX_RD_ATOMIC=4
UCX_RC_VERBS_TIMEOUT=1000000.00us
UCX_RC_VERBS_RETRY_COUNT=7
UCX_RC_VERBS_RNR_TIMEOUT=1000.00us
UCX_RC_VERBS_RNR_RETRY_COUNT=7
UCX_RC_VERBS_FC_ENABLE=y
UCX_RC_VERBS_FC_WND_SIZE=512
UCX_RC_VERBS_FC_HARD_THRESH=0.250
UCX_RC_VERBS_FENCE=auto
UCX_RC_VERBS_MAX_GET_ZCOPY=auto
UCX_RC_VERBS_TX_NUM_GET_BYTES=inf
UCX_RC_VERBS_TX_POLL_ALWAYS=n
UCX_RC_VERBS_FC_SOFT_THRESH=0.500
UCX_RC_VERBS_TX_CQ_MODERATION=64
UCX_RC_VERBS_TX_CQ_LEN=4096
UCX_RC_VERBS_MAX_AM_HDR=128
UCX_RC_VERBS_TX_MAX_WR=inf
UCX_RC_VERBS_FLUSH_MODE=auto
UCX_RC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_RC_MLX5_FAILURE=DIAG
UCX_RC_MLX5_MAX_NUM_EPS=256
UCX_RC_MLX5_SEG_SIZE=8256
UCX_RC_MLX5_TX_QUEUE_LEN=256
UCX_RC_MLX5_TX_MAX_BATCH=16
UCX_RC_MLX5_TX_MAX_POLL=16
UCX_RC_MLX5_TX_MIN_INLINE=64
UCX_RC_MLX5_TX_INLINE_RESP=64
UCX_RC_MLX5_TX_MIN_SGE=4
UCX_RC_MLX5_TX_MAX_BUFS=-1
UCX_RC_MLX5_TX_BUFS_GROW=1024
UCX_RC_MLX5_RX_QUEUE_LEN=4095
UCX_RC_MLX5_RX_MAX_BATCH=16
UCX_RC_MLX5_RX_MAX_POLL=16
UCX_RC_MLX5_RX_INLINE=64
UCX_RC_MLX5_RX_MAX_BUFS=-1
UCX_RC_MLX5_RX_BUFS_GROW=0
UCX_RC_MLX5_ADDR_TYPE=auto
UCX_RC_MLX5_IS_GLOBAL=n
UCX_RC_MLX5_SL=auto
UCX_RC_MLX5_TRAFFIC_CLASS=auto
UCX_RC_MLX5_HOP_LIMIT=255
UCX_RC_MLX5_NUM_PATHS=auto
UCX_RC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_RC_MLX5_ROCE_PATH_FACTOR=1
UCX_RC_MLX5_LID_PATH_BITS=0
UCX_RC_MLX5_PKEY=auto
UCX_RC_MLX5_PATH_MTU=default
UCX_RC_MLX5_MAX_RD_ATOMIC=4
UCX_RC_MLX5_TIMEOUT=1000000.00us
UCX_RC_MLX5_RETRY_COUNT=7
UCX_RC_MLX5_RNR_TIMEOUT=1000.00us
UCX_RC_MLX5_RNR_RETRY_COUNT=7
UCX_RC_MLX5_FC_ENABLE=y
UCX_RC_MLX5_FC_WND_SIZE=512
UCX_RC_MLX5_FC_HARD_THRESH=0.250
UCX_RC_MLX5_FENCE=auto
UCX_RC_MLX5_MAX_GET_ZCOPY=auto
UCX_RC_MLX5_TX_NUM_GET_BYTES=inf
UCX_RC_MLX5_TX_POLL_ALWAYS=n
UCX_RC_MLX5_FC_SOFT_THRESH=0.500
UCX_RC_MLX5_TX_CQ_MODERATION=64
UCX_RC_MLX5_TX_CQ_LEN=4096
UCX_RC_MLX5_DM_SIZE=2K
UCX_RC_MLX5_DM_COUNT=1
UCX_RC_MLX5_MMIO_MODE=auto
UCX_RC_MLX5_AR_ENABLE=auto
UCX_RC_MLX5_TX_MAX_BB=inf
UCX_RC_MLX5_TM_ENABLE=n
UCX_RC_MLX5_TM_LIST_SIZE=1024
UCX_RC_MLX5_TM_SEG_SIZE=48K
UCX_RC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_RC_MLX5_TM_MP_NUM_STRIDES=8
UCX_RC_MLX5_EXP_BACKOFF=0
UCX_RC_MLX5_SRQ_TOPO=cyclic,cyclic_emulated
UCX_DC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_DC_MLX5_FAILURE=DIAG
UCX_DC_MLX5_MAX_NUM_EPS=inf
UCX_DC_MLX5_SEG_SIZE=8256
UCX_DC_MLX5_TX_QUEUE_LEN=128
UCX_DC_MLX5_TX_MAX_BATCH=16
UCX_DC_MLX5_TX_MAX_POLL=16
UCX_DC_MLX5_TX_MIN_INLINE=64
UCX_DC_MLX5_TX_INLINE_RESP=64
UCX_DC_MLX5_TX_MIN_SGE=4
UCX_DC_MLX5_TX_MAX_BUFS=-1
UCX_DC_MLX5_TX_BUFS_GROW=1024
UCX_DC_MLX5_RX_QUEUE_LEN=4095
UCX_DC_MLX5_RX_MAX_BATCH=16
UCX_DC_MLX5_RX_MAX_POLL=16
UCX_DC_MLX5_RX_INLINE=64
UCX_DC_MLX5_RX_MAX_BUFS=-1
UCX_DC_MLX5_RX_BUFS_GROW=0
UCX_DC_MLX5_ADDR_TYPE=auto
UCX_DC_MLX5_IS_GLOBAL=n
UCX_DC_MLX5_SL=auto
UCX_DC_MLX5_TRAFFIC_CLASS=auto
UCX_DC_MLX5_HOP_LIMIT=255
UCX_DC_MLX5_NUM_PATHS=auto
UCX_DC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_DC_MLX5_ROCE_PATH_FACTOR=1
UCX_DC_MLX5_LID_PATH_BITS=0
UCX_DC_MLX5_PKEY=auto
UCX_DC_MLX5_PATH_MTU=default
UCX_DC_MLX5_MAX_RD_ATOMIC=4
UCX_DC_MLX5_TIMEOUT=1000000.00us
UCX_DC_MLX5_RETRY_COUNT=7
UCX_DC_MLX5_RNR_TIMEOUT=1000.00us
UCX_DC_MLX5_RNR_RETRY_COUNT=7
UCX_DC_MLX5_FC_ENABLE=y
UCX_DC_MLX5_FC_WND_SIZE=512
UCX_DC_MLX5_FC_HARD_THRESH=0.250
UCX_DC_MLX5_FENCE=auto
UCX_DC_MLX5_MAX_GET_ZCOPY=auto
UCX_DC_MLX5_TX_NUM_GET_BYTES=inf
UCX_DC_MLX5_TX_POLL_ALWAYS=n
UCX_DC_MLX5_DM_SIZE=2K
UCX_DC_MLX5_DM_COUNT=1
UCX_DC_MLX5_MMIO_MODE=auto
UCX_DC_MLX5_AR_ENABLE=auto
UCX_DC_MLX5_TX_MAX_BB=inf
UCX_DC_MLX5_TM_ENABLE=n
UCX_DC_MLX5_TM_LIST_SIZE=1024
UCX_DC_MLX5_TM_SEG_SIZE=48K
UCX_DC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_DC_MLX5_TM_MP_NUM_STRIDES=8
UCX_DC_MLX5_EXP_BACKOFF=0
UCX_DC_MLX5_SRQ_TOPO=list
UCX_DC_MLX5_RX_QUEUE_LEN_INIT=128
UCX_DC_MLX5_NUM_DCI=8
UCX_DC_MLX5_TX_POLICY=dcs_quota
UCX_DC_MLX5_DCI_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCI_KA_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCT_FULL_HANDSHAKE=n
UCX_DC_MLX5_RAND_DCI_SEED=0
UCX_DC_MLX5_QUOTA=32
UCX_DC_MLX5_FC_HARD_REQ_TIMEOUT=5000000.00us
UCX_DC_MLX5_COMPACT_AV=y
UCX_UD_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_UD_VERBS_FAILURE=DIAG
UCX_UD_VERBS_MAX_NUM_EPS=inf
UCX_UD_VERBS_SEG_SIZE=8K
UCX_UD_VERBS_TX_QUEUE_LEN=256
UCX_UD_VERBS_TX_MAX_BATCH=16
UCX_UD_VERBS_TX_MAX_POLL=16
UCX_UD_VERBS_TX_MIN_INLINE=64
UCX_UD_VERBS_TX_INLINE_RESP=0
UCX_UD_VERBS_TX_MIN_SGE=4
UCX_UD_VERBS_TX_MAX_BUFS=-1
UCX_UD_VERBS_TX_BUFS_GROW=1024
UCX_UD_VERBS_RX_QUEUE_LEN=4096
UCX_UD_VERBS_RX_MAX_BATCH=16
UCX_UD_VERBS_RX_MAX_POLL=16
UCX_UD_VERBS_RX_INLINE=0
UCX_UD_VERBS_RX_MAX_BUFS=-1
UCX_UD_VERBS_RX_BUFS_GROW=0
UCX_UD_VERBS_ADDR_TYPE=auto
UCX_UD_VERBS_IS_GLOBAL=n
UCX_UD_VERBS_SL=auto
UCX_UD_VERBS_TRAFFIC_CLASS=auto
UCX_UD_VERBS_HOP_LIMIT=255
UCX_UD_VERBS_NUM_PATHS=auto
UCX_UD_VERBS_ROCE_LOCAL_SUBNET=n
UCX_UD_VERBS_ROCE_PATH_FACTOR=1
UCX_UD_VERBS_LID_PATH_BITS=0
UCX_UD_VERBS_PKEY=auto
UCX_UD_VERBS_PATH_MTU=default
UCX_UD_VERBS_RX_QUEUE_LEN_INIT=128
UCX_UD_VERBS_TIMEOUT=300000000.00us
UCX_UD_VERBS_TIMER_TICK=10000.00us
UCX_UD_VERBS_TIMER_BACKOFF=2.000
UCX_UD_VERBS_ASYNC_TIMER_TICK=100000.00us
UCX_UD_VERBS_MIN_POKE_TIME=250000.00us
UCX_UD_VERBS_ETH_DGID_CHECK=y
UCX_UD_VERBS_MAX_WINDOW=1025
UCX_UD_VERBS_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_UD_MLX5_FAILURE=DIAG
UCX_UD_MLX5_MAX_NUM_EPS=inf
UCX_UD_MLX5_SEG_SIZE=8K
UCX_UD_MLX5_TX_QUEUE_LEN=256
UCX_UD_MLX5_TX_MAX_BATCH=16
UCX_UD_MLX5_TX_MAX_POLL=16
UCX_UD_MLX5_TX_MIN_INLINE=64
UCX_UD_MLX5_TX_INLINE_RESP=0
UCX_UD_MLX5_TX_MIN_SGE=4
UCX_UD_MLX5_TX_MAX_BUFS=-1
UCX_UD_MLX5_TX_BUFS_GROW=1024
UCX_UD_MLX5_RX_QUEUE_LEN=4096
UCX_UD_MLX5_RX_MAX_BATCH=16
UCX_UD_MLX5_RX_MAX_POLL=16
UCX_UD_MLX5_RX_INLINE=0
UCX_UD_MLX5_RX_MAX_BUFS=-1
UCX_UD_MLX5_RX_BUFS_GROW=0
UCX_UD_MLX5_ADDR_TYPE=auto
UCX_UD_MLX5_IS_GLOBAL=n
UCX_UD_MLX5_SL=auto
UCX_UD_MLX5_TRAFFIC_CLASS=auto
UCX_UD_MLX5_HOP_LIMIT=255
UCX_UD_MLX5_NUM_PATHS=auto
UCX_UD_MLX5_ROCE_LOCAL_SUBNET=n
UCX_UD_MLX5_ROCE_PATH_FACTOR=1
UCX_UD_MLX5_LID_PATH_BITS=0
UCX_UD_MLX5_PKEY=auto
UCX_UD_MLX5_PATH_MTU=default
UCX_UD_MLX5_RX_QUEUE_LEN_INIT=128
UCX_UD_MLX5_TIMEOUT=300000000.00us
UCX_UD_MLX5_TIMER_TICK=10000.00us
UCX_UD_MLX5_TIMER_BACKOFF=2.000
UCX_UD_MLX5_ASYNC_TIMER_TICK=100000.00us
UCX_UD_MLX5_MIN_POKE_TIME=250000.00us
UCX_UD_MLX5_ETH_DGID_CHECK=y
UCX_UD_MLX5_MAX_WINDOW=1025
UCX_UD_MLX5_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_DM_SIZE=2K
UCX_UD_MLX5_DM_COUNT=1
UCX_UD_MLX5_MMIO_MODE=auto
UCX_UD_MLX5_AR_ENABLE=auto
UCX_UD_MLX5_COMPACT_AV=y
UCX_RDMA_CM_FAILURE=DIAG
UCX_RDMA_CM_REUSEADDR=n
UCX_RDMA_CM_SOURCE_ADDRESS=
UCX_RDMA_CM_TIMEOUT=10000000.00us
UCX_RDMA_CM_RESERVED_QPN=try
UCX_CMA_ALLOC=huge,thp,mmap,heap
UCX_CMA_FAILURE=DIAG
UCX_CMA_MAX_NUM_EPS=inf
UCX_CMA_BW=11145.00MBps
UCX_CMA_MAX_IOV=16
UCX_CMA_SEG_SIZE=512K
UCX_CMA_TX_QUOTA=1
UCX_CMA_TX_MAX_BUFS=-1
UCX_CMA_TX_BUFS_GROW=8
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
# UCT version=1.11.0 revision 6031c98
# configured with: --disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt
#define UCX_CONFIG_H
#define ENABLE_BUILTIN_MEMCPY 1
#define ENABLE_DEBUG_DATA 0
#define ENABLE_MT 1
#define ENABLE_PARAMS_CHECK 0
#define HAVE_1_ARG_BFD_SECTION_SIZE 0
#define HAVE_ALLOCA 1
#define HAVE_ALLOCA_H 1
#define HAVE_ATTRIBUTE_NOOPTIMIZE 1
#define HAVE_CLEARENV 1
#define HAVE_CPU_SET_T 1
#define HAVE_CUDA 1
#define HAVE_CUDA_H 1
#define HAVE_CUDA_RUNTIME_H 1
#define HAVE_DC_DV 1
#define HAVE_DECL_ASPRINTF 1
#define HAVE_DECL_BASENAME 1
#define HAVE_DECL_BFD_GET_SECTION_FLAGS 0
#define HAVE_DECL_BFD_GET_SECTION_VMA 0
#define HAVE_DECL_BFD_SECTION_FLAGS 0
#define HAVE_DECL_BFD_SECTION_VMA 0
#define HAVE_DECL_CPU_ISSET 1
#define HAVE_DECL_CPU_ZERO 1
#define HAVE_DECL_ETHTOOL_CMD_SPEED 1
#define HAVE_DECL_FMEMOPEN 1
#define HAVE_DECL_FUSE_MOUNT 0
#define HAVE_DECL_FUSE_OPEN_CHANNEL 0
#define HAVE_DECL_FUSE_UNMOUNT 0
#define HAVE_DECL_F_SETOWN_EX 1
#define HAVE_DECL_GDR_COPY_TO_MAPPING 1
#define HAVE_DECL_IBV_ACCESS_ON_DEMAND 1
#define HAVE_DECL_IBV_ACCESS_RELAXED_ORDERING 1
#define HAVE_DECL_IBV_ADVISE_MR 1
#define HAVE_DECL_IBV_ALLOC_DM 1
#define HAVE_DECL_IBV_ALLOC_TD 1
#define HAVE_DECL_IBV_CMD_MODIFY_QP 0
#define HAVE_DECL_IBV_CREATE_CQ_ATTR_IGNORE_OVERRUN 1
#define HAVE_DECL_IBV_CREATE_QP_EX 1
#define HAVE_DECL_IBV_CREATE_SRQ 1
#define HAVE_DECL_IBV_CREATE_SRQ_EX 1
#define HAVE_DECL_IBV_EVENT_GID_CHANGE 1
#define HAVE_DECL_IBV_EVENT_TYPE_STR 1
#define HAVE_DECL_IBV_EXP_ACCESS_ALLOCATE_MR 0
#define HAVE_DECL_IBV_EXP_ACCESS_ON_DEMAND 0
#define HAVE_DECL_IBV_EXP_ALLOC_DM 0
#define HAVE_DECL_IBV_EXP_ATOMIC_HCA_REPLY_BE 0
#define HAVE_DECL_IBV_EXP_CQ_IGNORE_OVERRUN 0
#define HAVE_DECL_IBV_EXP_CQ_MODERATION 0
#define HAVE_DECL_IBV_EXP_CREATE_QP 0
#define HAVE_DECL_IBV_EXP_CREATE_SRQ 0
#define HAVE_DECL_IBV_EXP_DCT_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_PCI_ATOMIC_CAPS 0
#define HAVE_DECL_IBV_EXP_DEVICE_ATTR_RESERVED_2 0
#define HAVE_DECL_IBV_EXP_DEVICE_DC_TRANSPORT 0
#define HAVE_DECL_IBV_EXP_DEVICE_MR_ALLOCATE 0
#define HAVE_DECL_IBV_EXP_MR_FIXED_BUFFER_SIZE 0
#define HAVE_DECL_IBV_EXP_MR_INDIRECT_KLMS 0
#define HAVE_DECL_IBV_EXP_ODP_SUPPORT_IMPLICIT 0
#define HAVE_DECL_IBV_EXP_POST_SEND 0
#define HAVE_DECL_IBV_EXP_PREFETCH_MR 0
#define HAVE_DECL_IBV_EXP_PREFETCH_WRITE_ACCESS 0
#define HAVE_DECL_IBV_EXP_QPT_DC_INI 0
#define HAVE_DECL_IBV_EXP_QP_CREATE_UMR 0
#define HAVE_DECL_IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG 0
#define HAVE_DECL_IBV_EXP_QP_OOO_RW_DATA_PLACEMENT 0
#define HAVE_DECL_IBV_EXP_QUERY_DEVICE 0
#define HAVE_DECL_IBV_EXP_QUERY_GID_ATTR 0
#define HAVE_DECL_IBV_EXP_REG_MR 0
#define HAVE_DECL_IBV_EXP_SEND_EXT_ATOMIC_INLINE 0
#define HAVE_DECL_IBV_EXP_SETENV 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 0
#define HAVE_DECL_IBV_EXP_WR_EXT_MASKED_ATOMIC_FETCH_AND_ADD 0
#define HAVE_DECL_IBV_EXP_WR_NOP 0
#define HAVE_DECL_IBV_GET_ASYNC_EVENT 1
#define HAVE_DECL_IBV_GET_DEVICE_NAME 1
#define HAVE_DECL_IBV_LINK_LAYER_ETHERNET 1
#define HAVE_DECL_IBV_LINK_LAYER_INFINIBAND 1
#define HAVE_DECL_IBV_ODP_SUPPORT_IMPLICIT 1
#define HAVE_DECL_IBV_QPF_GRH_REQUIRED 1
#define HAVE_DECL_IBV_QUERY_DEVICE_EX 1
#define HAVE_DECL_IBV_QUERY_GID 1
#define HAVE_DECL_IBV_WC_STATUS_STR 1
#define HAVE_DECL_INOTIFY_ADD_WATCH 1
#define HAVE_DECL_INOTIFY_INIT 1
#define HAVE_DECL_IN_ATTRIB 1
#define HAVE_DECL_IPPROTO_TCP 1
#define HAVE_DECL_MADV_FREE 1
#define HAVE_DECL_MADV_REMOVE 1
#define HAVE_DECL_MLX5DV_CQ_INIT_ATTR_MASK_CQE_SIZE 1
#define HAVE_DECL_MLX5DV_CREATE_QP 1
#define HAVE_DECL_MLX5DV_DCTYPE_DCT 1
#define HAVE_DECL_MLX5DV_DEVX_SUBSCRIBE_DEVX_EVENT 1
#define HAVE_DECL_MLX5DV_INIT_OBJ 1
#define HAVE_DECL_MLX5DV_IS_SUPPORTED 1
#define HAVE_DECL_MLX5DV_OBJ_AH 1
#define HAVE_DECL_MLX5DV_QP_CREATE_ALLOW_SCATTER_TO_CQE 1
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_BF 0
#define HAVE_DECL_MLX5DV_UAR_ALLOC_TYPE_NC 0
#define HAVE_DECL_POSIX_MADV_DONTNEED 1
#define HAVE_DECL_PR_SET_PTRACER 1
#define HAVE_DECL_SOL_SOCKET 1
#define HAVE_DECL_SO_KEEPALIVE 1
#define HAVE_DECL_SPEED_UNKNOWN 1
#define HAVE_DECL_STRERROR_R 1
#define HAVE_DECL_SYS_BRK 1
#define HAVE_DECL_SYS_IPC 0
#define HAVE_DECL_SYS_MADVISE 1
#define HAVE_DECL_SYS_MMAP 1
#define HAVE_DECL_SYS_MREMAP 1
#define HAVE_DECL_SYS_MUNMAP 1
#define HAVE_DECL_SYS_SHMAT 1
#define HAVE_DECL_SYS_SHMDT 1
#define HAVE_DECL_TCP_KEEPCNT 1
#define HAVE_DECL_TCP_KEEPIDLE 1
#define HAVE_DECL_TCP_KEEPINTVL 1
#define HAVE_DECL___PPC_GET_TIMEBASE_FREQ 0
#define HAVE_DEVX 1
#define HAVE_DLFCN_H 1
#define HAVE_GDRAPI_H 1
#define HAVE_HW_TIMER 1
#define HAVE_IB 1
#define HAVE_IBV_DM 1
#define HAVE_IN6_ADDR_S6_ADDR32 1
#define HAVE_INFINIBAND_MLX5DV_H 1
#define HAVE_INFINIBAND_TM_TYPES_H 1
#define HAVE_INOTIFY 1
#define HAVE_INTTYPES_H 1
#define HAVE_IP_IP_DST 1
#define HAVE_LIBGEN_H 1
#define HAVE_LIBRT 1
#define HAVE_LINUX_FUTEX_H 1
#define HAVE_LINUX_IP_H 1
#define HAVE_LINUX_MMAN_H 1
#define HAVE_MALLOC_H 1
#define HAVE_MALLOC_HOOK 1
#define HAVE_MALLOC_TRIM 1
#define HAVE_MEMALIGN 1
#define HAVE_MEMORY_H 1
#define HAVE_MLX5_HW 1
#define HAVE_MLX5_HW_UD 1
#define HAVE_MREMAP 1
#define HAVE_NETINET_IP_H 1
#define HAVE_NET_ETHERNET_H 1
#define HAVE_NUMA 1
#define HAVE_NUMAIF_H 1
#define HAVE_NUMA_H 1
#define HAVE_ODP 1
#define HAVE_ODP_IMPLICIT 1
#define HAVE_POSIX_MEMALIGN 1
#define HAVE_PREFETCH 1
#define HAVE_SCHED_GETAFFINITY 1
#define HAVE_SCHED_SETAFFINITY 1
#define HAVE_SIGACTION_SA_RESTORER 1
#define HAVE_SIGEVENT_SIGEV_UN_TID 1
#define HAVE_SIGHANDLER_T 1
#define HAVE_STDINT_H 1
#define HAVE_STDLIB_H 1
#define HAVE_STRERROR_R 1
#define HAVE_STRINGS_H 1
#define HAVE_STRING_H 1
#define HAVE_STRUCT_BITMASK 1
#define HAVE_STRUCT_DL_PHDR_INFO 1
#define HAVE_STRUCT_IBV_DEVICE_ATTR_EX_PCI_ATOMIC_CAPS 1
#define HAVE_STRUCT_IBV_TM_CAPS_FLAGS 1
#define HAVE_STRUCT_MLX5DV_CQ_CQ_UAR 1
#define HAVE_SYS_EPOLL_H 1
#define HAVE_SYS_EVENTFD_H 1
#define HAVE_SYS_STAT_H 1
#define HAVE_SYS_TYPES_H 1
#define HAVE_SYS_UIO_H 1
#define HAVE_TL_DC 1
#define HAVE_TL_RC 1
#define HAVE_TL_UD 1
#define HAVE_UCM_PTMALLOC286 1
#define HAVE_UNISTD_H 1
#define HAVE___CLEAR_CACHE 1
#define HAVE___CURBRK 1
#define HAVE___SIGHANDLER_T 1
#define IBV_HW_TM 1
#define LT_OBJDIR ".libs/"
#define NVALGRIND 1
#define PACKAGE "ucx"
#define PACKAGE_BUGREPORT ""
#define PACKAGE_NAME "ucx"
#define PACKAGE_STRING "ucx 1.11"
#define PACKAGE_TARNAME "ucx"
#define PACKAGE_URL ""
#define PACKAGE_VERSION "1.11"
#define STDC_HEADERS 1
#define STRERROR_R_CHAR_P 1
#define UCM_BISTRO_HOOKS 1
#define UCS_MAX_LOG_LEVEL UCS_LOG_LEVEL_DEBUG
#define UCT_TCP_EP_KEEPALIVE 1
#define UCT_UD_EP_DEBUG_HOOKS 0
#define UCX_CONFIGURE_FLAGS "--disable-logging --disable-debug --disable-assertions --disable-params-check --enable-mt --without-knem --with-xpmem=/hpc/local/oss/xpmem/90a95a4 --without-java --enable-devel-headers --with-cuda=/hpc/local/oss/cuda11.2 --with-gdrcopy --prefix=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt"
#define UCX_MODULE_SUBDIR "ucx"
#define VERSION "1.11"
#define restrict __restrict
#define test_MODULES ":module"
#define ucg_MODULES ":builtin"
#define ucm_MODULES ":cuda"
#define ucs_MODULES ""
#define uct_MODULES ":cuda:ib:rdmacm:cma:xpmem"
#define uct_cuda_MODULES ":gdrcopy"
#define uct_ib_MODULES ""
#define uct_rocm_MODULES ""
#define ucx_perftest_MODULES ":cuda"
#
# Memory domain: posix
# Component: posix
# allocate: unlimited
# remote key: 24 bytes
# rkey_ptr is supported
#
# Transport: posix
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: sysv
# Component: sysv
# allocate: unlimited
# remote key: 12 bytes
# rkey_ptr is supported
#
# Transport: sysv
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 12179.00 MB/sec
# latency: 80 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 100
# am_bcopy: <= 8256
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: self
# Component: self
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: self
# Device: memory0
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 6911.00 MB/sec
# latency: 0 nsec
# overhead: 10 nsec
# put_short: <= 4294967295
# put_bcopy: unlimited
# get_bcopy: unlimited
# am_short: <= 8K
# am_bcopy: <= 8K
# domain: cpu
# atomic_add: 32, 64 bit
# atomic_and: 32, 64 bit
# atomic_or: 32, 64 bit
# atomic_xor: 32, 64 bit
# atomic_fadd: 32, 64 bit
# atomic_fand: 32, 64 bit
# atomic_for: 32, 64 bit
# atomic_fxor: 32, 64 bit
# atomic_swap: 32, 64 bit
# atomic_cswap: 32, 64 bit
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: ep_check
#
#
# Memory domain: tcp
# Component: tcp
# register: unlimited, cost: 0 nsec
# remote key: 0 bytes
#
# Transport: tcp
# Device: eth4
# System device: <unknown>
#
# capabilities:
# bandwidth: 11316.36/ppn + 0.00 MB/sec
# latency: 5206 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 0
# device num paths: 1
# max eps: 256
# device address: 6 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
# Transport: tcp
# Device: lo
# System device: <unknown>
#
# capabilities:
# bandwidth: 11.91/ppn + 0.00 MB/sec
# latency: 10960 nsec
# overhead: 50000 nsec
# put_zcopy: <= 18446744073709551590, up to 6 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 0
# am_short: <= 8K
# am_bcopy: <= 8K
# am_zcopy: <= 64K, up to 6 iov
# am_opt_zcopy_align: <= 1
# am_align_mtu: <= 0
# am header: <= 8037
# connection: to ep, to iface
# device priority: 1
# device num paths: 1
# max eps: 256
# device address: 18 bytes
# iface address: 2 bytes
# ep address: 10 bytes
# error handling: peer failure, ep_check, keepalive
#
#
# Connection manager: tcp
# max_conn_priv: 2064 bytes
#
# Memory domain: cuda_cpy
# Component: cuda_cpy
# allocate: unlimited
# register: unlimited, cost: 0 nsec
#
# Transport: cuda_copy
# Device: cuda
# System device: <unknown>
#
# capabilities:
# bandwidth: 10000.00/ppn + 0.00 MB/sec
# latency: 8000 nsec
# overhead: 0 nsec
# put_short: <= 4294967295
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_short: <= 4294967295
# get_zcopy: unlimited, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 0 bytes
# iface address: 8 bytes
# error handling: none
#
#
# Memory domain: cuda_ipc
# Component: cuda_ipc
# register: unlimited, cost: 0 nsec
# remote key: 112 bytes
#
# Transport: cuda_ipc
# Device: cuda
# System device: <unknown>
#
# capabilities:
# bandwidth: 300000.00/ppn + 0.00 MB/sec
# latency: 1 nsec
# overhead: 0 nsec
# put_zcopy: unlimited, up to 1 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 1 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
# < failed to open connection manager rdmacm >
#
# Memory domain: cma
# Component: cma
# register: unlimited, cost: 9 nsec
#
# Transport: cma
# Device: memory
# System device: <unknown>
#
# capabilities:
# bandwidth: 0.00/ppn + 11145.00 MB/sec
# latency: 80 nsec
# overhead: 400 nsec
# put_zcopy: unlimited, up to 16 iov
# put_opt_zcopy_align: <= 1
# put_align_mtu: <= 1
# get_zcopy: unlimited, up to 16 iov
# get_opt_zcopy_align: <= 1
# get_align_mtu: <= 1
# connection: to iface
# device priority: 0
# device num paths: 1
# max eps: inf
# device address: 8 bytes
# iface address: 4 bytes
# error handling: peer failure, ep_check
#
UCX_LOG_LEVEL=WARN
UCX_LOG_FILE_FILTER=*
UCX_LOG_FILE=
UCX_LOG_FILE_SIZE=inf
UCX_LOG_FILE_ROTATE=0
UCX_LOG_BUFFER=1K
UCX_LOG_DATA_SIZE=0
UCX_LOG_PRINT_ENABLE=n
UCX_HANDLE_ERRORS=bt
UCX_ERROR_SIGNALS=ILL,SEGV,BUS,FPE
UCX_ERROR_MAIL_TO=
UCX_ERROR_MAIL_FOOTER=
UCX_GDB_COMMAND=gdb -quiet
UCX_DEBUG_SIGNO=HUP
UCX_LOG_LEVEL_TRIGGER=FATAL
UCX_WARN_UNUSED_ENV_VARS=y
UCX_ASYNC_MAX_EVENTS=1024
UCX_ASYNC_SIGNO=ALRM
UCX_VFS_ENABLE=y
UCX_PROFILE_MODE=
UCX_PROFILE_FILE=ucx_%h_%p.prof
UCX_PROFILE_LOG_SIZE=4M
UCX_RCACHE_CHECK_PFN=0
UCX_MODULE_DIR=/build-result/hpcx-v2.9.0-gcc-inbox-ubuntu20.04-x86_64/ucx/mt/lib/ucx
UCX_MODULE_LOG_LEVEL=TRACE
UCX_BUILTIN_MEMCPY_MIN=auto
UCX_BUILTIN_MEMCPY_MAX=auto
UCX_MEM_LOG_LEVEL=WARN
UCX_MEM_ALLOC_ALIGN=16
UCX_MEM_EVENTS=y
UCX_MEM_MMAP_HOOK_MODE=bistro
UCX_MEM_MALLOC_HOOKS=y
UCX_MEM_MALLOC_RELOC=y
UCX_MEM_CUDA_HOOK_MODE=bistro
UCX_MEM_DYNAMIC_MMAP_THRESH=y
UCX_MEM_DLOPEN_PROCESS_RPATH=y
UCX_MEM_MODULE_UNLOAD_PREVENT_MODE=lazy
UCX_POSIX_HUGETLB_MODE=try
UCX_POSIX_DIR=/dev/shm
UCX_POSIX_USE_PROC_LINK=y
UCX_POSIX_ALLOC=md,mmap,heap
UCX_POSIX_FAILURE=DIAG
UCX_POSIX_MAX_NUM_EPS=inf
UCX_POSIX_BW=12179.00MBps
UCX_POSIX_FIFO_SIZE=64
UCX_POSIX_SEG_SIZE=8256
UCX_POSIX_FIFO_RELEASE_FACTOR=0.500
UCX_POSIX_RX_MAX_BUFS=-1
UCX_POSIX_RX_BUFS_GROW=512
UCX_POSIX_FIFO_HUGETLB=n
UCX_POSIX_FIFO_ELEM_SIZE=128
UCX_POSIX_FIFO_MAX_POLL=16
UCX_POSIX_ERROR_HANDLING=n
UCX_SYSV_HUGETLB_MODE=try
UCX_SYSV_ALLOC=md,mmap,heap
UCX_SYSV_FAILURE=DIAG
UCX_SYSV_MAX_NUM_EPS=inf
UCX_SYSV_BW=12179.00MBps
UCX_SYSV_FIFO_SIZE=64
UCX_SYSV_SEG_SIZE=8256
UCX_SYSV_FIFO_RELEASE_FACTOR=0.500
UCX_SYSV_RX_MAX_BUFS=-1
UCX_SYSV_RX_BUFS_GROW=512
UCX_SYSV_FIFO_HUGETLB=n
UCX_SYSV_FIFO_ELEM_SIZE=128
UCX_SYSV_FIFO_MAX_POLL=16
UCX_SYSV_ERROR_HANDLING=n
UCX_SELF_ALLOC=huge,thp,md,mmap,heap
UCX_SELF_FAILURE=DIAG
UCX_SELF_MAX_NUM_EPS=inf
UCX_SELF_SEG_SIZE=8K
UCX_SELF_NUM_DEVICES=1
UCX_TCP_ALLOC=huge,thp,md,mmap,heap
UCX_TCP_FAILURE=DIAG
UCX_TCP_MAX_NUM_EPS=256
UCX_TCP_TX_SEG_SIZE=8K
UCX_TCP_RX_SEG_SIZE=64K
UCX_TCP_MAX_IOV=6
UCX_TCP_SENDV_THRESH=2K
UCX_TCP_PREFER_DEFAULT=y
UCX_TCP_PUT_ENABLE=y
UCX_TCP_CONN_NB=n
UCX_TCP_MAX_POLL=16
UCX_TCP_MAX_CONN_RETRIES=25
UCX_TCP_NODELAY=y
UCX_TCP_SNDBUF=auto
UCX_TCP_RCVBUF=auto
UCX_TCP_SYN_CNT=auto
UCX_TCP_TX_MAX_BUFS=-1
UCX_TCP_TX_BUFS_GROW=8
UCX_TCP_RX_MAX_BUFS=-1
UCX_TCP_RX_BUFS_GROW=8
UCX_TCP_PORT_RANGE=0
UCX_TCP_KEEPIDLE=10000000.00us
UCX_TCP_KEEPCNT=3
UCX_TCP_KEEPINTVL=1000000.00us
UCX_TCP_CM_FAILURE=DIAG
UCX_TCP_CM_REUSEADDR=n
UCX_TCP_CM_PRIV_DATA_LEN=2K
UCX_TCP_CM_SNDBUF=auto
UCX_TCP_CM_RCVBUF=auto
UCX_TCP_CM_SYN_CNT=auto
UCX_NET_DEVICES=all
UCX_SHM_DEVICES=all
UCX_ACC_DEVICES=all
UCX_SELF_DEVICES=all
UCX_TLS=all
UCX_ALLOC_PRIO=md:sysv,md:posix,huge,thp,md:*,mmap,heap
UCX_SOCKADDR_TLS_PRIORITY=rdmacm,tcp,sockcm
UCX_SOCKADDR_AUX_TLS=ud
UCX_SELECT_DISTANCE_MD=cuda_cpy
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=
UCX_WARN_INVALID_CONFIG=y
UCX_BCOPY_THRESH=0
UCX_RNDV_THRESH=auto
UCX_RNDV_SEND_NBR_THRESH=256K
UCX_RNDV_THRESH_FALLBACK=inf
UCX_RNDV_PERF_DIFF=1.000
UCX_MULTI_LANE_MAX_RATIO=4.000
UCX_MAX_EAGER_RAILS=1
UCX_MAX_RNDV_RAILS=2
UCX_RNDV_SCHEME=auto
UCX_RKEY_PTR_SEG_SIZE=512K
UCX_ZCOPY_THRESH=auto
UCX_BCOPY_BW=auto
UCX_ATOMIC_MODE=guess
UCX_ADDRESS_DEBUG_INFO=n
UCX_MAX_WORKER_ADDRESS_NAME=32
UCX_USE_MT_MUTEX=n
UCX_ADAPTIVE_PROGRESS=y
UCX_SEG_SIZE=8K
UCX_TM_THRESH=1K
UCX_TM_MAX_BB_SIZE=1K
UCX_TM_FORCE_THRESH=8K
UCX_TM_SW_RNDV=n
UCX_NUM_EPS=auto
UCX_NUM_PPN=auto
UCX_RNDV_FRAG_SIZE=512K
UCX_RNDV_PIPELINE_SEND_THRESH=inf
UCX_MEMTYPE_CACHE=y
UCX_FLUSH_WORKER_EPS=y
UCX_UNIFIED_MODE=n
UCX_CM_USE_ALL_DEVICES=y
UCX_LISTENER_BACKLOG=auto
UCX_PROTO_ENABLE=n
UCX_KEEPALIVE_INTERVAL=60000000.00us
UCX_KEEPALIVE_NUM_EPS=128
UCX_PROTO_INDIRECT_ID=auto
UCX_ERROR_HANDLER_DELAY=0.00us
UCX_CUDA_COPY_ALLOC=huge,thp,md,mmap,heap
UCX_CUDA_COPY_FAILURE=DIAG
UCX_CUDA_COPY_MAX_NUM_EPS=inf
UCX_CUDA_COPY_MAX_POLL=16
UCX_CUDA_COPY_MAX_EVENTS=inf
UCX_CUDA_IPC_ALLOC=huge,thp,md,mmap,heap
UCX_CUDA_IPC_FAILURE=DIAG
UCX_CUDA_IPC_MAX_NUM_EPS=inf
UCX_CUDA_IPC_MAX_POLL=16
UCX_CUDA_IPC_MAX_STREAMS=16
UCX_CUDA_IPC_CACHE=y
UCX_CUDA_IPC_MAX_EVENTS=inf
UCX_GDR_COPY_RCACHE=try
UCX_GDR_COPY_RCACHE_MEM_PRIO=1000
UCX_GDR_COPY_RCACHE_OVERHEAD=0.18us
UCX_GDR_COPY_RCACHE_ADDR_ALIGN=65536
UCX_GDR_COPY_RCACHE_MAX_REGIONS=inf
UCX_GDR_COPY_RCACHE_MAX_SIZE=inf
UCX_GDR_COPY_MEM_REG_OVERHEAD=16.00us
UCX_GDR_COPY_MEM_REG_GROWTH=0.00us
UCX_GDR_COPY_ALLOC=huge,thp,md,mmap,heap
UCX_GDR_COPY_FAILURE=DIAG
UCX_GDR_COPY_MAX_NUM_EPS=inf
UCX_IB_REG_METHODS=rcache,odp,direct
UCX_IB_RCACHE_MEM_PRIO=1000
UCX_IB_RCACHE_OVERHEAD=0.18us
UCX_IB_RCACHE_ADDR_ALIGN=16
UCX_IB_RCACHE_MAX_REGIONS=inf
UCX_IB_RCACHE_MAX_SIZE=inf
UCX_IB_MEM_REG_OVERHEAD=16.00us
UCX_IB_MEM_REG_GROWTH=0.00us
UCX_IB_FORK_INIT=try
UCX_IB_ASYNC_EVENTS=y
UCX_IB_ETH_PAUSE_ON=y
UCX_IB_ODP_NUMA_POLICY=preferred
UCX_IB_ODP_PREFETCH=n
UCX_IB_ODP_MAX_SIZE=auto
UCX_IB_DEVICE_SPECS=
UCX_IB_PREFER_NEAREST_DEVICE=y
UCX_IB_INDIRECT_ATOMIC=y
UCX_IB_GID_INDEX=auto
UCX_IB_SUBNET_PREFIX=
UCX_IB_GPU_DIRECT_RDMA=try
UCX_IB_PCI_BW=
UCX_IB_MLX5_DEVX=n
UCX_IB_MLX5_DEVX_OBJECTS=rcqp,rcsrq,dct,dcsrq,dci
UCX_IB_REG_MT_THRESH=4G
UCX_IB_REG_MT_CHUNK=2G
UCX_IB_REG_MT_BIND=n
UCX_IB_PCI_RELAXED_ORDERING=auto
UCX_RC_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_RC_VERBS_FAILURE=DIAG
UCX_RC_VERBS_MAX_NUM_EPS=256
UCX_RC_VERBS_SEG_SIZE=8256
UCX_RC_VERBS_TX_QUEUE_LEN=256
UCX_RC_VERBS_TX_MAX_BATCH=16
UCX_RC_VERBS_TX_MAX_POLL=16
UCX_RC_VERBS_TX_MIN_INLINE=64
UCX_RC_VERBS_TX_INLINE_RESP=64
UCX_RC_VERBS_TX_MIN_SGE=4
UCX_RC_VERBS_TX_MAX_BUFS=-1
UCX_RC_VERBS_TX_BUFS_GROW=1024
UCX_RC_VERBS_RX_QUEUE_LEN=4095
UCX_RC_VERBS_RX_MAX_BATCH=16
UCX_RC_VERBS_RX_MAX_POLL=16
UCX_RC_VERBS_RX_INLINE=64
UCX_RC_VERBS_RX_MAX_BUFS=-1
UCX_RC_VERBS_RX_BUFS_GROW=0
UCX_RC_VERBS_ADDR_TYPE=auto
UCX_RC_VERBS_IS_GLOBAL=n
UCX_RC_VERBS_SL=auto
UCX_RC_VERBS_TRAFFIC_CLASS=auto
UCX_RC_VERBS_HOP_LIMIT=255
UCX_RC_VERBS_NUM_PATHS=auto
UCX_RC_VERBS_ROCE_LOCAL_SUBNET=n
UCX_RC_VERBS_ROCE_PATH_FACTOR=1
UCX_RC_VERBS_LID_PATH_BITS=0
UCX_RC_VERBS_PKEY=auto
UCX_RC_VERBS_PATH_MTU=default
UCX_RC_VERBS_MAX_RD_ATOMIC=4
UCX_RC_VERBS_TIMEOUT=1000000.00us
UCX_RC_VERBS_RETRY_COUNT=7
UCX_RC_VERBS_RNR_TIMEOUT=1000.00us
UCX_RC_VERBS_RNR_RETRY_COUNT=7
UCX_RC_VERBS_FC_ENABLE=y
UCX_RC_VERBS_FC_WND_SIZE=512
UCX_RC_VERBS_FC_HARD_THRESH=0.250
UCX_RC_VERBS_FENCE=auto
UCX_RC_VERBS_MAX_GET_ZCOPY=auto
UCX_RC_VERBS_TX_NUM_GET_BYTES=inf
UCX_RC_VERBS_TX_POLL_ALWAYS=n
UCX_RC_VERBS_FC_SOFT_THRESH=0.500
UCX_RC_VERBS_TX_CQ_MODERATION=64
UCX_RC_VERBS_TX_CQ_LEN=4096
UCX_RC_VERBS_MAX_AM_HDR=128
UCX_RC_VERBS_TX_MAX_WR=inf
UCX_RC_VERBS_FLUSH_MODE=auto
UCX_RC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_RC_MLX5_FAILURE=DIAG
UCX_RC_MLX5_MAX_NUM_EPS=256
UCX_RC_MLX5_SEG_SIZE=8256
UCX_RC_MLX5_TX_QUEUE_LEN=256
UCX_RC_MLX5_TX_MAX_BATCH=16
UCX_RC_MLX5_TX_MAX_POLL=16
UCX_RC_MLX5_TX_MIN_INLINE=64
UCX_RC_MLX5_TX_INLINE_RESP=64
UCX_RC_MLX5_TX_MIN_SGE=4
UCX_RC_MLX5_TX_MAX_BUFS=-1
UCX_RC_MLX5_TX_BUFS_GROW=1024
UCX_RC_MLX5_RX_QUEUE_LEN=4095
UCX_RC_MLX5_RX_MAX_BATCH=16
UCX_RC_MLX5_RX_MAX_POLL=16
UCX_RC_MLX5_RX_INLINE=64
UCX_RC_MLX5_RX_MAX_BUFS=-1
UCX_RC_MLX5_RX_BUFS_GROW=0
UCX_RC_MLX5_ADDR_TYPE=auto
UCX_RC_MLX5_IS_GLOBAL=n
UCX_RC_MLX5_SL=auto
UCX_RC_MLX5_TRAFFIC_CLASS=auto
UCX_RC_MLX5_HOP_LIMIT=255
UCX_RC_MLX5_NUM_PATHS=auto
UCX_RC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_RC_MLX5_ROCE_PATH_FACTOR=1
UCX_RC_MLX5_LID_PATH_BITS=0
UCX_RC_MLX5_PKEY=auto
UCX_RC_MLX5_PATH_MTU=default
UCX_RC_MLX5_MAX_RD_ATOMIC=4
UCX_RC_MLX5_TIMEOUT=1000000.00us
UCX_RC_MLX5_RETRY_COUNT=7
UCX_RC_MLX5_RNR_TIMEOUT=1000.00us
UCX_RC_MLX5_RNR_RETRY_COUNT=7
UCX_RC_MLX5_FC_ENABLE=y
UCX_RC_MLX5_FC_WND_SIZE=512
UCX_RC_MLX5_FC_HARD_THRESH=0.250
UCX_RC_MLX5_FENCE=auto
UCX_RC_MLX5_MAX_GET_ZCOPY=auto
UCX_RC_MLX5_TX_NUM_GET_BYTES=inf
UCX_RC_MLX5_TX_POLL_ALWAYS=n
UCX_RC_MLX5_FC_SOFT_THRESH=0.500
UCX_RC_MLX5_TX_CQ_MODERATION=64
UCX_RC_MLX5_TX_CQ_LEN=4096
UCX_RC_MLX5_DM_SIZE=2K
UCX_RC_MLX5_DM_COUNT=1
UCX_RC_MLX5_MMIO_MODE=auto
UCX_RC_MLX5_AR_ENABLE=auto
UCX_RC_MLX5_TX_MAX_BB=inf
UCX_RC_MLX5_TM_ENABLE=n
UCX_RC_MLX5_TM_LIST_SIZE=1024
UCX_RC_MLX5_TM_SEG_SIZE=48K
UCX_RC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_RC_MLX5_TM_MP_NUM_STRIDES=8
UCX_RC_MLX5_EXP_BACKOFF=0
UCX_RC_MLX5_SRQ_TOPO=cyclic,cyclic_emulated
UCX_DC_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_DC_MLX5_FAILURE=DIAG
UCX_DC_MLX5_MAX_NUM_EPS=inf
UCX_DC_MLX5_SEG_SIZE=8256
UCX_DC_MLX5_TX_QUEUE_LEN=128
UCX_DC_MLX5_TX_MAX_BATCH=16
UCX_DC_MLX5_TX_MAX_POLL=16
UCX_DC_MLX5_TX_MIN_INLINE=64
UCX_DC_MLX5_TX_INLINE_RESP=64
UCX_DC_MLX5_TX_MIN_SGE=4
UCX_DC_MLX5_TX_MAX_BUFS=-1
UCX_DC_MLX5_TX_BUFS_GROW=1024
UCX_DC_MLX5_RX_QUEUE_LEN=4095
UCX_DC_MLX5_RX_MAX_BATCH=16
UCX_DC_MLX5_RX_MAX_POLL=16
UCX_DC_MLX5_RX_INLINE=64
UCX_DC_MLX5_RX_MAX_BUFS=-1
UCX_DC_MLX5_RX_BUFS_GROW=0
UCX_DC_MLX5_ADDR_TYPE=auto
UCX_DC_MLX5_IS_GLOBAL=n
UCX_DC_MLX5_SL=auto
UCX_DC_MLX5_TRAFFIC_CLASS=auto
UCX_DC_MLX5_HOP_LIMIT=255
UCX_DC_MLX5_NUM_PATHS=auto
UCX_DC_MLX5_ROCE_LOCAL_SUBNET=n
UCX_DC_MLX5_ROCE_PATH_FACTOR=1
UCX_DC_MLX5_LID_PATH_BITS=0
UCX_DC_MLX5_PKEY=auto
UCX_DC_MLX5_PATH_MTU=default
UCX_DC_MLX5_MAX_RD_ATOMIC=4
UCX_DC_MLX5_TIMEOUT=1000000.00us
UCX_DC_MLX5_RETRY_COUNT=7
UCX_DC_MLX5_RNR_TIMEOUT=1000.00us
UCX_DC_MLX5_RNR_RETRY_COUNT=7
UCX_DC_MLX5_FC_ENABLE=y
UCX_DC_MLX5_FC_WND_SIZE=512
UCX_DC_MLX5_FC_HARD_THRESH=0.250
UCX_DC_MLX5_FENCE=auto
UCX_DC_MLX5_MAX_GET_ZCOPY=auto
UCX_DC_MLX5_TX_NUM_GET_BYTES=inf
UCX_DC_MLX5_TX_POLL_ALWAYS=n
UCX_DC_MLX5_DM_SIZE=2K
UCX_DC_MLX5_DM_COUNT=1
UCX_DC_MLX5_MMIO_MODE=auto
UCX_DC_MLX5_AR_ENABLE=auto
UCX_DC_MLX5_TX_MAX_BB=inf
UCX_DC_MLX5_TM_ENABLE=n
UCX_DC_MLX5_TM_LIST_SIZE=1024
UCX_DC_MLX5_TM_SEG_SIZE=48K
UCX_DC_MLX5_TM_MP_SRQ_ENABLE=try
UCX_DC_MLX5_TM_MP_NUM_STRIDES=8
UCX_DC_MLX5_EXP_BACKOFF=0
UCX_DC_MLX5_SRQ_TOPO=list
UCX_DC_MLX5_RX_QUEUE_LEN_INIT=128
UCX_DC_MLX5_NUM_DCI=8
UCX_DC_MLX5_TX_POLICY=dcs_quota
UCX_DC_MLX5_DCI_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCI_KA_FULL_HANDSHAKE=n
UCX_DC_MLX5_DCT_FULL_HANDSHAKE=n
UCX_DC_MLX5_RAND_DCI_SEED=0
UCX_DC_MLX5_QUOTA=32
UCX_DC_MLX5_FC_HARD_REQ_TIMEOUT=5000000.00us
UCX_DC_MLX5_COMPACT_AV=y
UCX_UD_VERBS_ALLOC=huge,thp,md,mmap,heap
UCX_UD_VERBS_FAILURE=DIAG
UCX_UD_VERBS_MAX_NUM_EPS=inf
UCX_UD_VERBS_SEG_SIZE=8K
UCX_UD_VERBS_TX_QUEUE_LEN=256
UCX_UD_VERBS_TX_MAX_BATCH=16
UCX_UD_VERBS_TX_MAX_POLL=16
UCX_UD_VERBS_TX_MIN_INLINE=64
UCX_UD_VERBS_TX_INLINE_RESP=0
UCX_UD_VERBS_TX_MIN_SGE=4
UCX_UD_VERBS_TX_MAX_BUFS=-1
UCX_UD_VERBS_TX_BUFS_GROW=1024
UCX_UD_VERBS_RX_QUEUE_LEN=4096
UCX_UD_VERBS_RX_MAX_BATCH=16
UCX_UD_VERBS_RX_MAX_POLL=16
UCX_UD_VERBS_RX_INLINE=0
UCX_UD_VERBS_RX_MAX_BUFS=-1
UCX_UD_VERBS_RX_BUFS_GROW=0
UCX_UD_VERBS_ADDR_TYPE=auto
UCX_UD_VERBS_IS_GLOBAL=n
UCX_UD_VERBS_SL=auto
UCX_UD_VERBS_TRAFFIC_CLASS=auto
UCX_UD_VERBS_HOP_LIMIT=255
UCX_UD_VERBS_NUM_PATHS=auto
UCX_UD_VERBS_ROCE_LOCAL_SUBNET=n
UCX_UD_VERBS_ROCE_PATH_FACTOR=1
UCX_UD_VERBS_LID_PATH_BITS=0
UCX_UD_VERBS_PKEY=auto
UCX_UD_VERBS_PATH_MTU=default
UCX_UD_VERBS_RX_QUEUE_LEN_INIT=128
UCX_UD_VERBS_TIMEOUT=300000000.00us
UCX_UD_VERBS_TIMER_TICK=10000.00us
UCX_UD_VERBS_TIMER_BACKOFF=2.000
UCX_UD_VERBS_ASYNC_TIMER_TICK=100000.00us
UCX_UD_VERBS_MIN_POKE_TIME=250000.00us
UCX_UD_VERBS_ETH_DGID_CHECK=y
UCX_UD_VERBS_MAX_WINDOW=1025
UCX_UD_VERBS_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_ALLOC=huge,thp,md,mmap,heap
UCX_UD_MLX5_FAILURE=DIAG
UCX_UD_MLX5_MAX_NUM_EPS=inf
UCX_UD_MLX5_SEG_SIZE=8K
UCX_UD_MLX5_TX_QUEUE_LEN=256
UCX_UD_MLX5_TX_MAX_BATCH=16
UCX_UD_MLX5_TX_MAX_POLL=16
UCX_UD_MLX5_TX_MIN_INLINE=64
UCX_UD_MLX5_TX_INLINE_RESP=0
UCX_UD_MLX5_TX_MIN_SGE=4
UCX_UD_MLX5_TX_MAX_BUFS=-1
UCX_UD_MLX5_TX_BUFS_GROW=1024
UCX_UD_MLX5_RX_QUEUE_LEN=4096
UCX_UD_MLX5_RX_MAX_BATCH=16
UCX_UD_MLX5_RX_MAX_POLL=16
UCX_UD_MLX5_RX_INLINE=0
UCX_UD_MLX5_RX_MAX_BUFS=-1
UCX_UD_MLX5_RX_BUFS_GROW=0
UCX_UD_MLX5_ADDR_TYPE=auto
UCX_UD_MLX5_IS_GLOBAL=n
UCX_UD_MLX5_SL=auto
UCX_UD_MLX5_TRAFFIC_CLASS=auto
UCX_UD_MLX5_HOP_LIMIT=255
UCX_UD_MLX5_NUM_PATHS=auto
UCX_UD_MLX5_ROCE_LOCAL_SUBNET=n
UCX_UD_MLX5_ROCE_PATH_FACTOR=1
UCX_UD_MLX5_LID_PATH_BITS=0
UCX_UD_MLX5_PKEY=auto
UCX_UD_MLX5_PATH_MTU=default
UCX_UD_MLX5_RX_QUEUE_LEN_INIT=128
UCX_UD_MLX5_TIMEOUT=300000000.00us
UCX_UD_MLX5_TIMER_TICK=10000.00us
UCX_UD_MLX5_TIMER_BACKOFF=2.000
UCX_UD_MLX5_ASYNC_TIMER_TICK=100000.00us
UCX_UD_MLX5_MIN_POKE_TIME=250000.00us
UCX_UD_MLX5_ETH_DGID_CHECK=y
UCX_UD_MLX5_MAX_WINDOW=1025
UCX_UD_MLX5_RX_ASYNC_MAX_POLL=64
UCX_UD_MLX5_DM_SIZE=2K
UCX_UD_MLX5_DM_COUNT=1
UCX_UD_MLX5_MMIO_MODE=auto
UCX_UD_MLX5_AR_ENABLE=auto
UCX_UD_MLX5_COMPACT_AV=y
UCX_RDMA_CM_FAILURE=DIAG
UCX_RDMA_CM_REUSEADDR=n
UCX_RDMA_CM_SOURCE_ADDRESS=
UCX_RDMA_CM_TIMEOUT=10000000.00us
UCX_RDMA_CM_RESERVED_QPN=try
UCX_CMA_ALLOC=huge,thp,mmap,heap
UCX_CMA_FAILURE=DIAG
UCX_CMA_MAX_NUM_EPS=inf
UCX_CMA_BW=11145.00MBps
UCX_CMA_MAX_IOV=16
UCX_CMA_SEG_SIZE=512K
UCX_CMA_TX_QUOTA=1
UCX_CMA_TX_MAX_BUFS=-1
UCX_CMA_TX_BUFS_GROW=8
@Artemy-Mellanox
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ofed_info
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: execve(): ofed_info: No such file or directory
srun: error: node001: task 0: Exited with exit code 2
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: execve(): ofed_info: No such file or directory
srun: error: node002: task 1: Exited with exit code 2
[root@bright88 mxnet]#
@karanveersingh5623 need a bit more info. could you please run those commands and attach the output
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area gdb -ex r -ex "info sharedlibrary" -ex q --args "$(which ucx_info)" -c
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibv_devinfo
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area lspci
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibstat
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ls -la /sys/class/infiniband
@Artemy-Mellanox
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area gdb -ex r -ex "info sharedlibrary" -ex q --args "$(which ucx_info)" -c
/usr/bin/which: no ucx_info in (/cm/shared/apps/slurm/current/sbin:/cm/shared/apps/slurm/current/bin:/cm/local/apps/cm-setup/bin:/cm/local/apps/cluster-tools/bin:/cm/local/apps/cmd/sbin:/cm/local/apps/cmd/bin:/cm/local/apps/environment-modules/4.5.3//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/cm/local/apps/environment-modules/4.5.3/bin:/bin:/sbin:/opt/dell/srvadmin/bin:/opt/dell/srvadmin/sbin:/root/bin)
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: execve(): gdb: No such file or directory
srun: error: node001: task 0: Exited with exit code 2
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibv_devinfo
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
No IB devices found
srun: error: node001: task 0: Exited with exit code 255
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibstat
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.31.1014
Hardware version: 0
Node GUID: 0x043f720300dc0684
System image GUID: 0x043f720300dc0684
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x063f72fffedc0684
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4119
Number of ports: 1
Firmware version: 16.31.1014
Hardware version: 0
Node GUID: 0x043f720300dc0685
System image GUID: 0x043f720300dc0684
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x063f72fffedc0685
Link layer: Ethernet
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]#
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ls -la /sys/class/infiniband
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
total 0
drwxr-xr-x 2 root root 0 Nov 15 18:30 .
drwxr-xr-x 91 root root 0 Nov 15 12:34 ..
lrwxrwxrwx 1 root root 0 Nov 15 18:30 mlx5_0 -> ../../devices/pci0000:97/0000:97:02.0/0000:98:00.0/infiniband/mlx5_0
lrwxrwxrwx 1 root root 0 Nov 15 18:30 mlx5_1 -> ../../devices/pci0000:97/0000:97:02.0/0000:98:00.1/infiniband/mlx5_1
@Artemy-Mellanox
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area lspci
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
00:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
00:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
00:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
00:00.4 Host bridge: Intel Corporation Device 0998
00:02.0 System peripheral: Intel Corporation Device 09a6
00:02.1 System peripheral: Intel Corporation Device 09a7
00:02.4 Non-Essential Instrumentation [1300]: Intel Corporation Device 3456 (rev 01)
00:11.0 Unassigned class [ff00]: Intel Corporation C620 Series Chipset Family MROM 0 (rev 0a)
00:11.5 SATA controller: Intel Corporation C620 Series Chipset Family SSATA Controller [AHCI mode] (rev 0a)
00:14.0 USB controller: Intel Corporation C620 Series Chipset Family USB 3.0 xHCI Controller (rev 0a)
00:14.2 Signal processing controller: Intel Corporation C620 Series Chipset Family Thermal Subsystem (rev 0a)
00:16.0 Communication controller: Intel Corporation C620 Series Chipset Family MEI Controller #1 (rev 0a)
00:16.1 Communication controller: Intel Corporation C620 Series Chipset Family MEI Controller #2 (rev 0a)
00:16.4 Communication controller: Intel Corporation C620 Series Chipset Family MEI Controller #3 (rev 0a)
00:17.0 SATA controller: Intel Corporation C620 Series Chipset Family SATA Controller [AHCI mode] (rev 0a)
00:1c.0 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #1 (rev fa)
00:1c.4 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #5 (rev fa)
00:1c.5 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #6 (rev fa)
00:1d.0 PCI bridge: Intel Corporation C620 Series Chipset Family PCI Express Root Port #9 (rev fa)
00:1f.0 ISA bridge: Intel Corporation Device a1cb (rev 0a)
00:1f.2 Memory controller: Intel Corporation C620 Series Chipset Family Power Management Controller (rev 0a)
00:1f.4 SMBus: Intel Corporation C620 Series Chipset Family SMBus (rev 0a)
00:1f.5 Serial bus controller [0c80]: Intel Corporation C620 Series Chipset Family SPI Controller (rev 0a)
02:00.0 PCI bridge: PLDA PCI Express Bridge (rev 02)
03:00.0 VGA compatible controller: Matrox Electronics Systems Ltd. Integrated Matrox G200eW3 Graphics Controller (rev 04)
04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 2-port Gigabit Ethernet PCIe
05:00.0 SATA controller: Marvell Technology Group Ltd. 88SE9230 PCIe SATA 6Gb/s Controller (rev 11)
16:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
16:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
16:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
16:00.4 Host bridge: Intel Corporation Device 0998
16:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
17:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
30:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
30:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
30:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
30:00.4 Host bridge: Intel Corporation Device 0998
30:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
30:03.0 PCI bridge: Intel Corporation Device 347b (rev 04)
30:04.0 PCI bridge: Intel Corporation Device 347c (rev 04)
31:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd Device a80a
33:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
33:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
33:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
33:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
4a:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
4a:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
4a:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
4a:00.4 Host bridge: Intel Corporation Device 0998
64:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
64:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
64:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
64:00.4 Host bridge: Intel Corporation Device 0998
64:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
65:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
7e:00.0 System peripheral: Intel Corporation Device 3450
7e:00.1 System peripheral: Intel Corporation Device 3451
7e:00.2 System peripheral: Intel Corporation Device 3452
7e:00.3 Host bridge: Intel Corporation Device 0998
7e:00.5 System peripheral: Intel Corporation Device 3455
7e:02.0 System peripheral: Intel Corporation Device 3440
7e:02.1 System peripheral: Intel Corporation Device 3441
7e:02.2 System peripheral: Intel Corporation Device 3442
7e:03.0 System peripheral: Intel Corporation Device 3440
7e:03.1 System peripheral: Intel Corporation Device 3441
7e:03.2 System peripheral: Intel Corporation Device 3442
7e:04.0 System peripheral: Intel Corporation Device 3440
7e:04.1 System peripheral: Intel Corporation Device 3441
7e:04.2 System peripheral: Intel Corporation Device 3442
7e:04.3 System peripheral: Intel Corporation Device 3443
7e:05.0 System peripheral: Intel Corporation Device 3445
7e:05.1 System peripheral: Intel Corporation Device 3446
7e:05.2 System peripheral: Intel Corporation Device 3447
7e:06.0 System peripheral: Intel Corporation Device 3445
7e:06.1 System peripheral: Intel Corporation Device 3446
7e:06.2 System peripheral: Intel Corporation Device 3447
7e:07.0 System peripheral: Intel Corporation Device 3445
7e:07.1 System peripheral: Intel Corporation Device 3446
7e:07.2 System peripheral: Intel Corporation Device 3447
7e:0b.0 System peripheral: Intel Corporation Device 3448
7e:0b.1 System peripheral: Intel Corporation Device 3448
7e:0b.2 System peripheral: Intel Corporation Device 344b
7e:0c.0 Performance counters: Intel Corporation Device 344a
7e:0d.0 Performance counters: Intel Corporation Device 344a
7e:0e.0 Performance counters: Intel Corporation Device 344a
7e:0f.0 Performance counters: Intel Corporation Device 344a
7e:1a.0 Performance counters: Intel Corporation Device 2880
7e:1b.0 Performance counters: Intel Corporation Device 2880
7e:1c.0 Performance counters: Intel Corporation Device 2880
7e:1d.0 Performance counters: Intel Corporation Device 2880
7f:00.0 System peripheral: Intel Corporation Device 344c
7f:00.1 System peripheral: Intel Corporation Device 344c
7f:00.2 System peripheral: Intel Corporation Device 344c
7f:00.3 System peripheral: Intel Corporation Device 344c
7f:00.4 System peripheral: Intel Corporation Device 344c
7f:00.5 System peripheral: Intel Corporation Device 344c
7f:00.6 System peripheral: Intel Corporation Device 344c
7f:00.7 System peripheral: Intel Corporation Device 344c
7f:01.0 System peripheral: Intel Corporation Device 344c
7f:01.1 System peripheral: Intel Corporation Device 344c
7f:01.2 System peripheral: Intel Corporation Device 344c
7f:01.3 System peripheral: Intel Corporation Device 344c
7f:01.4 System peripheral: Intel Corporation Device 344c
7f:01.5 System peripheral: Intel Corporation Device 344c
7f:01.6 System peripheral: Intel Corporation Device 344c
7f:01.7 System peripheral: Intel Corporation Device 344c
7f:02.0 System peripheral: Intel Corporation Device 344c
7f:02.1 System peripheral: Intel Corporation Device 344c
7f:02.2 System peripheral: Intel Corporation Device 344c
7f:02.3 System peripheral: Intel Corporation Device 344c
7f:02.4 System peripheral: Intel Corporation Device 344c
7f:02.5 System peripheral: Intel Corporation Device 344c
7f:02.6 System peripheral: Intel Corporation Device 344c
7f:02.7 System peripheral: Intel Corporation Device 344c
7f:03.0 System peripheral: Intel Corporation Device 344c
7f:03.1 System peripheral: Intel Corporation Device 344c
7f:03.2 System peripheral: Intel Corporation Device 344c
7f:03.3 System peripheral: Intel Corporation Device 344c
7f:03.4 System peripheral: Intel Corporation Device 344c
7f:03.5 System peripheral: Intel Corporation Device 344c
7f:03.6 System peripheral: Intel Corporation Device 344c
7f:03.7 System peripheral: Intel Corporation Device 344c
7f:04.0 System peripheral: Intel Corporation Device 344c
7f:04.1 System peripheral: Intel Corporation Device 344c
7f:04.2 System peripheral: Intel Corporation Device 344c
7f:04.3 System peripheral: Intel Corporation Device 344c
7f:04.4 System peripheral: Intel Corporation Device 344c
7f:04.5 System peripheral: Intel Corporation Device 344c
7f:04.6 System peripheral: Intel Corporation Device 344c
7f:04.7 System peripheral: Intel Corporation Device 344c
7f:0a.0 System peripheral: Intel Corporation Device 344d
7f:0a.1 System peripheral: Intel Corporation Device 344d
7f:0a.2 System peripheral: Intel Corporation Device 344d
7f:0a.3 System peripheral: Intel Corporation Device 344d
7f:0a.4 System peripheral: Intel Corporation Device 344d
7f:0a.5 System peripheral: Intel Corporation Device 344d
7f:0a.6 System peripheral: Intel Corporation Device 344d
7f:0a.7 System peripheral: Intel Corporation Device 344d
7f:0b.0 System peripheral: Intel Corporation Device 344d
7f:0b.1 System peripheral: Intel Corporation Device 344d
7f:0b.2 System peripheral: Intel Corporation Device 344d
7f:0b.3 System peripheral: Intel Corporation Device 344d
7f:0b.4 System peripheral: Intel Corporation Device 344d
7f:0b.5 System peripheral: Intel Corporation Device 344d
7f:0b.6 System peripheral: Intel Corporation Device 344d
7f:0b.7 System peripheral: Intel Corporation Device 344d
7f:0c.0 System peripheral: Intel Corporation Device 344d
7f:0c.1 System peripheral: Intel Corporation Device 344d
7f:0c.2 System peripheral: Intel Corporation Device 344d
7f:0c.3 System peripheral: Intel Corporation Device 344d
7f:0c.4 System peripheral: Intel Corporation Device 344d
7f:0c.5 System peripheral: Intel Corporation Device 344d
7f:0c.6 System peripheral: Intel Corporation Device 344d
7f:0c.7 System peripheral: Intel Corporation Device 344d
7f:0d.0 System peripheral: Intel Corporation Device 344d
7f:0d.1 System peripheral: Intel Corporation Device 344d
7f:0d.2 System peripheral: Intel Corporation Device 344d
7f:0d.3 System peripheral: Intel Corporation Device 344d
7f:0d.4 System peripheral: Intel Corporation Device 344d
7f:0d.5 System peripheral: Intel Corporation Device 344d
7f:0d.6 System peripheral: Intel Corporation Device 344d
7f:0d.7 System peripheral: Intel Corporation Device 344d
7f:0e.0 System peripheral: Intel Corporation Device 344d
7f:0e.1 System peripheral: Intel Corporation Device 344d
7f:0e.2 System peripheral: Intel Corporation Device 344d
7f:0e.3 System peripheral: Intel Corporation Device 344d
7f:0e.4 System peripheral: Intel Corporation Device 344d
7f:0e.5 System peripheral: Intel Corporation Device 344d
7f:0e.6 System peripheral: Intel Corporation Device 344d
7f:0e.7 System peripheral: Intel Corporation Device 344d
7f:1d.0 System peripheral: Intel Corporation Device 344f
7f:1d.1 System peripheral: Intel Corporation Device 3457
7f:1e.0 System peripheral: Intel Corporation Device 3458 (rev 06)
7f:1e.1 System peripheral: Intel Corporation Device 3459 (rev 06)
7f:1e.2 System peripheral: Intel Corporation Device 345a (rev 06)
7f:1e.3 System peripheral: Intel Corporation Device 345b (rev 06)
7f:1e.4 System peripheral: Intel Corporation Device 345c (rev 06)
7f:1e.5 System peripheral: Intel Corporation Device 345d (rev 06)
7f:1e.6 System peripheral: Intel Corporation Device 345e (rev 06)
7f:1e.7 System peripheral: Intel Corporation Device 345f (rev 06)
80:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
80:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
80:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
80:00.4 Host bridge: Intel Corporation Device 0998
80:02.0 System peripheral: Intel Corporation Device 09a6
80:02.1 System peripheral: Intel Corporation Device 09a7
80:02.4 Non-Essential Instrumentation [1300]: Intel Corporation Device 3456 (rev 01)
97:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
97:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
97:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
97:00.4 Host bridge: Intel Corporation Device 0998
97:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
98:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
98:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
b0:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
b0:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
b0:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
b0:00.4 Host bridge: Intel Corporation Device 0998
c9:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
c9:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
c9:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
c9:00.4 Host bridge: Intel Corporation Device 0998
c9:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
ca:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
e2:00.0 System peripheral: Intel Corporation Device 09a2 (rev 04)
e2:00.1 System peripheral: Intel Corporation Device 09a4 (rev 04)
e2:00.2 System peripheral: Intel Corporation Device 09a3 (rev 04)
e2:00.4 Host bridge: Intel Corporation Device 0998
e2:02.0 PCI bridge: Intel Corporation Device 347a (rev 04)
e3:00.0 3D controller: NVIDIA Corporation Device 20b5 (rev a1)
fe:00.0 System peripheral: Intel Corporation Device 3450
fe:00.1 System peripheral: Intel Corporation Device 3451
fe:00.2 System peripheral: Intel Corporation Device 3452
fe:00.3 Host bridge: Intel Corporation Device 0998
fe:00.5 System peripheral: Intel Corporation Device 3455
fe:02.0 System peripheral: Intel Corporation Device 3440
fe:02.1 System peripheral: Intel Corporation Device 3441
fe:02.2 System peripheral: Intel Corporation Device 3442
fe:03.0 System peripheral: Intel Corporation Device 3440
fe:03.1 System peripheral: Intel Corporation Device 3441
fe:03.2 System peripheral: Intel Corporation Device 3442
fe:04.0 System peripheral: Intel Corporation Device 3440
fe:04.1 System peripheral: Intel Corporation Device 3441
fe:04.2 System peripheral: Intel Corporation Device 3442
fe:04.3 System peripheral: Intel Corporation Device 3443
fe:05.0 System peripheral: Intel Corporation Device 3445
fe:05.1 System peripheral: Intel Corporation Device 3446
fe:05.2 System peripheral: Intel Corporation Device 3447
fe:06.0 System peripheral: Intel Corporation Device 3445
fe:06.1 System peripheral: Intel Corporation Device 3446
fe:06.2 System peripheral: Intel Corporation Device 3447
fe:07.0 System peripheral: Intel Corporation Device 3445
fe:07.1 System peripheral: Intel Corporation Device 3446
fe:07.2 System peripheral: Intel Corporation Device 3447
fe:0b.0 System peripheral: Intel Corporation Device 3448
fe:0b.1 System peripheral: Intel Corporation Device 3448
fe:0b.2 System peripheral: Intel Corporation Device 344b
fe:0c.0 Performance counters: Intel Corporation Device 344a
fe:0d.0 Performance counters: Intel Corporation Device 344a
fe:0e.0 Performance counters: Intel Corporation Device 344a
fe:0f.0 Performance counters: Intel Corporation Device 344a
fe:1a.0 Performance counters: Intel Corporation Device 2880
fe:1b.0 Performance counters: Intel Corporation Device 2880
fe:1c.0 Performance counters: Intel Corporation Device 2880
fe:1d.0 Performance counters: Intel Corporation Device 2880
ff:00.0 System peripheral: Intel Corporation Device 344c
ff:00.1 System peripheral: Intel Corporation Device 344c
ff:00.2 System peripheral: Intel Corporation Device 344c
ff:00.3 System peripheral: Intel Corporation Device 344c
ff:00.4 System peripheral: Intel Corporation Device 344c
ff:00.5 System peripheral: Intel Corporation Device 344c
ff:00.6 System peripheral: Intel Corporation Device 344c
ff:00.7 System peripheral: Intel Corporation Device 344c
ff:01.0 System peripheral: Intel Corporation Device 344c
ff:01.1 System peripheral: Intel Corporation Device 344c
ff:01.2 System peripheral: Intel Corporation Device 344c
ff:01.3 System peripheral: Intel Corporation Device 344c
ff:01.4 System peripheral: Intel Corporation Device 344c
ff:01.5 System peripheral: Intel Corporation Device 344c
ff:01.6 System peripheral: Intel Corporation Device 344c
ff:01.7 System peripheral: Intel Corporation Device 344c
ff:02.0 System peripheral: Intel Corporation Device 344c
ff:02.1 System peripheral: Intel Corporation Device 344c
ff:02.2 System peripheral: Intel Corporation Device 344c
ff:02.3 System peripheral: Intel Corporation Device 344c
ff:02.4 System peripheral: Intel Corporation Device 344c
ff:02.5 System peripheral: Intel Corporation Device 344c
ff:02.6 System peripheral: Intel Corporation Device 344c
ff:02.7 System peripheral: Intel Corporation Device 344c
ff:03.0 System peripheral: Intel Corporation Device 344c
ff:03.1 System peripheral: Intel Corporation Device 344c
ff:03.2 System peripheral: Intel Corporation Device 344c
ff:03.3 System peripheral: Intel Corporation Device 344c
ff:03.4 System peripheral: Intel Corporation Device 344c
ff:03.5 System peripheral: Intel Corporation Device 344c
ff:03.6 System peripheral: Intel Corporation Device 344c
ff:03.7 System peripheral: Intel Corporation Device 344c
ff:04.0 System peripheral: Intel Corporation Device 344c
ff:04.1 System peripheral: Intel Corporation Device 344c
ff:04.2 System peripheral: Intel Corporation Device 344c
ff:04.3 System peripheral: Intel Corporation Device 344c
ff:04.4 System peripheral: Intel Corporation Device 344c
ff:04.5 System peripheral: Intel Corporation Device 344c
ff:04.6 System peripheral: Intel Corporation Device 344c
ff:04.7 System peripheral: Intel Corporation Device 344c
ff:0a.0 System peripheral: Intel Corporation Device 344d
ff:0a.1 System peripheral: Intel Corporation Device 344d
ff:0a.2 System peripheral: Intel Corporation Device 344d
ff:0a.3 System peripheral: Intel Corporation Device 344d
ff:0a.4 System peripheral: Intel Corporation Device 344d
ff:0a.5 System peripheral: Intel Corporation Device 344d
ff:0a.6 System peripheral: Intel Corporation Device 344d
ff:0a.7 System peripheral: Intel Corporation Device 344d
ff:0b.0 System peripheral: Intel Corporation Device 344d
ff:0b.1 System peripheral: Intel Corporation Device 344d
ff:0b.2 System peripheral: Intel Corporation Device 344d
ff:0b.3 System peripheral: Intel Corporation Device 344d
ff:0b.4 System peripheral: Intel Corporation Device 344d
ff:0b.5 System peripheral: Intel Corporation Device 344d
ff:0b.6 System peripheral: Intel Corporation Device 344d
ff:0b.7 System peripheral: Intel Corporation Device 344d
ff:0c.0 System peripheral: Intel Corporation Device 344d
ff:0c.1 System peripheral: Intel Corporation Device 344d
ff:0c.2 System peripheral: Intel Corporation Device 344d
ff:0c.3 System peripheral: Intel Corporation Device 344d
ff:0c.4 System peripheral: Intel Corporation Device 344d
ff:0c.5 System peripheral: Intel Corporation Device 344d
ff:0c.6 System peripheral: Intel Corporation Device 344d
ff:0c.7 System peripheral: Intel Corporation Device 344d
ff:0d.0 System peripheral: Intel Corporation Device 344d
ff:0d.1 System peripheral: Intel Corporation Device 344d
ff:0d.2 System peripheral: Intel Corporation Device 344d
ff:0d.3 System peripheral: Intel Corporation Device 344d
ff:0d.4 System peripheral: Intel Corporation Device 344d
ff:0d.5 System peripheral: Intel Corporation Device 344d
ff:0d.6 System peripheral: Intel Corporation Device 344d
ff:0d.7 System peripheral: Intel Corporation Device 344d
ff:0e.0 System peripheral: Intel Corporation Device 344d
ff:0e.1 System peripheral: Intel Corporation Device 344d
ff:0e.2 System peripheral: Intel Corporation Device 344d
ff:0e.3 System peripheral: Intel Corporation Device 344d
ff:0e.4 System peripheral: Intel Corporation Device 344d
ff:0e.5 System peripheral: Intel Corporation Device 344d
ff:0e.6 System peripheral: Intel Corporation Device 344d
ff:0e.7 System peripheral: Intel Corporation Device 344d
ff:1d.0 System peripheral: Intel Corporation Device 344f
ff:1d.1 System peripheral: Intel Corporation Device 3457
ff:1e.0 System peripheral: Intel Corporation Device 3458 (rev 06)
ff:1e.1 System peripheral: Intel Corporation Device 3459 (rev 06)
ff:1e.2 System peripheral: Intel Corporation Device 345a (rev 06)
ff:1e.3 System peripheral: Intel Corporation Device 345b (rev 06)
ff:1e.4 System peripheral: Intel Corporation Device 345c (rev 06)
ff:1e.5 System peripheral: Intel Corporation Device 345d (rev 06)
ff:1e.6 System peripheral: Intel Corporation Device 345e (rev 06)
ff:1e.7 System peripheral: Intel Corporation Device 345f (rev 06)
@karanveersingh5623 setup has strange problems. could you please post the output of the following commands on the host, not docker.
uname -a
lsmod
modinfo ib_uverbs
modinfo ib_umad
modinfo mlx5_ib
ibstat
ibv_devinfo
@Artemy-Mellanox
Please find the details from my compute node001
[root@node001 ~]# uname -a
Linux node001 4.18.0-372.9.1.el8.x86_64 #1 SMP Tue May 10 14:48:47 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# lsmod
Module Size Used by
mgc 102400 1
lustre 1040384 7036
lmv 204800 2 lustre
mdc 278528 2 lustre
fid 36864 1 mdc
lov 344064 4693 mdc,lustre
fld 45056 2 lov,lmv
osc 454656 4692 mdc
ksocklnd 184320 1
ptlrpc 1425408 8 fld,osc,fid,mgc,lov,mdc,lmv,lustre
obdclass 3362816 4826 fld,osc,fid,ptlrpc,mgc,lov,mdc,lmv,lustre
lnet 704512 7 osc,obdclass,ptlrpc,mgc,ksocklnd,lmv,lustre
libcfs 266240 12 fld,lnet,osc,fid,obdclass,ptlrpc,mgc,ksocklnd,lov,mdc,lmv,lustre
xt_conntrack 16384 1
ipt_MASQUERADE 16384 1
nf_conntrack_netlink 49152 0
nft_counter 16384 15
xt_addrtype 16384 2
nft_compat 20480 4
nft_chain_nat 16384 4
nf_nat 45056 2 ipt_MASQUERADE,nft_chain_nat
nf_conntrack 172032 4 xt_conntrack,nf_nat,ipt_MASQUERADE,nf_conntrack_netlink
nf_defrag_ipv6 20480 1 nf_conntrack
nf_defrag_ipv4 16384 1 nf_conntrack
nf_tables 180224 43 nft_compat,nft_counter,nft_chain_nat
nfnetlink 16384 3 nft_compat,nf_conntrack_netlink,nf_tables
overlay 139264 0
dell_rbu 16384 0
nvidia_drm 69632 0
nvidia_modeset 1142784 1 nvidia_drm
nvidia_uvm 1298432 0
nvidia 40792064 163 nvidia_uvm,nvidia_modeset
intel_rapl_msr 16384 0
intel_rapl_common 24576 1 intel_rapl_msr
ipmi_ssif 36864 0
i10nm_edac 24576 0
nfit 65536 1 i10nm_edac
libnvdimm 196608 1 nfit
x86_pkg_temp_thermal 16384 0
intel_powerclamp 16384 0
coretemp 16384 0
kvm_intel 339968 0
iTCO_wdt 16384 0
kvm 905216 1 kvm_intel
irqbypass 16384 1 kvm
dell_smbios 24576 0
crc32_pclmul 16384 0
iTCO_vendor_support 16384 1 iTCO_wdt
dell_wmi_descriptor 16384 1 dell_smbios
wmi_bmof 16384 0
rapl 20480 0
mgag200 36864 0
dcdbas 16384 0
intel_cstate 20480 0
rpcrdma 282624 0
drm_kms_helper 266240 4 mgag200,nvidia_drm
intel_uncore 204800 0
pcspkr 16384 0
syscopyarea 16384 1 drm_kms_helper
sysfillrect 16384 1 drm_kms_helper
joydev 24576 0
sysimgblt 16384 1 drm_kms_helper
fb_sys_fops 16384 1 drm_kms_helper
isst_if_mbox_pci 16384 0
drm 585728 5 drm_kms_helper,nvidia,mgag200,nvidia_drm
isst_if_mmio 16384 0
isst_if_common 16384 2 isst_if_mmio,isst_if_mbox_pci
mei_me 45056 0
i2c_i801 28672 0
mei 118784 1 mei_me
acpi_ipmi 16384 0
intel_pmt 16384 0
wmi 32768 3 wmi_bmof,dell_smbios,dell_wmi_descriptor
ipmi_si 69632 1
acpi_power_meter 20480 0
binfmt_misc 20480 1
ipmi_devintf 20480 0
ipmi_msghandler 110592 4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif
lpfc 1179648 0
nvmet_fc 40960 1 lpfc
nvmet 110592 1 nvmet_fc
nvme_fc 53248 1 lpfc
nvme_fabrics 24576 1 nvme_fc
iavf 151552 0
ixgbevf 77824 0
mlx4_en 135168 0
mlx4_core 364544 1 mlx4_en
qedr 126976 0
qede 184320 1 qedr
qed 778240 2 qede,qedr
crc8 16384 1 qed
hpilo 20480 0
sr_mod 28672 0
xts 16384 0
dm_crypt 49152 0
bnxt_en 286720 0
mpt3sas 335872 0
raid_class 16384 1 mpt3sas
usb_storage 73728 0
squashfs 65536 0
loop 40960 0
isofs 49152 0
smartpqi 98304 0
dm_thin_pool 86016 0
dm_bio_prison 20480 1 dm_thin_pool
dm_persistent_data 94208 1 dm_thin_pool
dm_bufio 32768 1 dm_persistent_data
dm_mod 151552 3 dm_crypt,dm_thin_pool,dm_bufio
udf 102400 0
crc_itu_t 16384 1 udf
cdrom 65536 3 udf,isofs,sr_mod
scsi_transport_fc 81920 1 lpfc
vfat 20480 1
fat 81920 1 vfat
br_netfilter 24576 0
bridge 278528 1 br_netfilter
stp 16384 1 bridge
llc 16384 2 bridge,stp
xfs 1556480 1
qla3xxx 49152 0
hpsa 102400 0
e1000e 286720 0
ixgbe 376832 0
igb 253952 0
i2c_algo_bit 16384 2 igb,mgag200
dca 16384 2 igb,ixgbe
megaraid_sas 176128 0
aacraid 139264 0
ata_piix 36864 0
sd_mod 53248 0
mptspi 28672 0
scsi_transport_spi 40960 1 mptspi
mptsas 69632 0
mptscsih 45056 2 mptsas,mptspi
mptbase 98304 3 mptsas,mptspi,mptscsih
scsi_transport_sas 45056 4 mptsas,hpsa,smartpqi,mpt3sas
bnx2x 876544 0
mdio 16384 2 bnx2x,ixgbe
libcrc32c 16384 6 nf_conntrack,nf_nat,dm_persistent_data,bnx2x,nf_tables,xfs
bnx2 94208 0
ext4 761856 0
mbcache 16384 1 ext4
jbd2 131072 1 ext4
e1000 151552 0
nfsv4 835584 0
dns_resolver 16384 1 nfsv4
nfsv3 53248 1
nfs_acl 16384 1 nfsv3
nfs 385024 4 nfsv4,nfsv3
lockd 122880 2 nfsv3,nfs
grace 16384 1 lockd
sunrpc 565248 23 lnet,rpcrdma,nfsv4,lockd,nfsv3,nfs_acl,nfs
fscache 385024 1 nfs
tun 49152 0
irdma 356352 0
ice 765952 1 irdma
rdma_ucm 32768 0
ib_srpt 69632 0
ib_isert 57344 0
iscsi_target_mod 356352 1 ib_isert
target_core_mod 417792 3 iscsi_target_mod,ib_srpt,ib_isert
ib_iser 49152 0
libiscsi 61440 1 ib_iser
scsi_transport_iscsi 131072 2 ib_iser,libiscsi
ib_umad 28672 0
rdma_cm 114688 5 rpcrdma,ib_srpt,ib_iser,ib_isert,rdma_ucm
ib_ipoib 147456 0
iw_cm 53248 1 rdma_cm
ib_cm 114688 3 rdma_cm,ib_ipoib,ib_srpt
mlx5_ib 389120 0
ib_uverbs 163840 4 irdma,rdma_ucm,mlx5_ib,qedr
ib_core 393216 14 rdma_cm,ib_ipoib,rpcrdma,ib_srpt,iw_cm,ib_iser,ib_umad,ib_isert,irdma,rdma_ucm,ib_uverbs,mlx5_ib,qedr,ib_cm
sg 40960 0
mlx5_core 1572864 1 mlx5_ib
crct10dif_pclmul 16384 1
crc32c_intel 24576 1
pci_hyperv_intf 16384 1 mlx5_core
i40e 491520 1 irdma
ahci 40960 0
psample 20480 1 mlx5_core
ghash_clmulni_intel 16384 0
nvme 45056 3
libahci 40960 1 ahci
mlxfw 28672 1 mlx5_core
tls 102400 1 mlx5_core
libata 262144 3 ata_piix,libahci,ahci
nvme_core 114688 7 nvme,nvme_fc,nvme_fabrics
tg3 188416 0
t10_pi 16384 3 nvmet,sd_mod,nvme_core
fuse 155648 1
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# modinfo ib_uverbs
filename: /lib/modules/4.18.0-372.9.1.el8.x86_64/kernel/drivers/infiniband/core/ib_uverbs.ko.xz
alias: rdma-client-uverbs
license: Dual BSD/GPL
description: InfiniBand userspace verbs access
author: Roland Dreier
rhelversion: 8.6
srcversion: F485E52CF6F50429494777A
depends: ib_core
intree: Y
name: ib_uverbs
vermagic: 4.18.0-372.9.1.el8.x86_64 SMP mod_unload modversions
sig_id: PKCS#7
signer: Rocky kernel signing key
sig_key: 24:62:83:5E:57:6D:46:8C:7B:45:DD:87:7C:69:5A:C6:BC:46:85:94
sig_hashalgo: sha256
signature: 3B:5D:F8:D7:4E:50:C2:51:0E:AB:BD:C8:26:B9:7E:DB:F8:41:15:F3:
83:06:82:74:BE:CC:D7:55:CC:C9:52:93:67:F8:6E:7D:44:09:FC:45:
4F:8E:30:49:42:A1:6B:6D:B8:8C:D5:D9:B0:E8:2B:9B:B8:F2:AB:BA:
61:72:A9:56:1C:B5:2C:CB:86:31:64:7E:3D:4F:ED:78:49:CA:5D:FD:
5F:AB:0C:E2:5B:45:A0:40:7A:E8:5B:7C:6A:EE:F3:18:CC:E5:38:58:
94:C7:90:B1:66:64:63:25:57:0C:85:B8:F6:FD:60:B0:70:90:67:3A:
9F:8F:62:7F:A6:A8:E1:50:57:4A:5C:43:E8:9C:6C:B1:91:46:4F:64:
61:91:BE:C9:DD:48:07:62:70:A1:90:81:00:DD:50:11:CC:D8:F5:F4:
B5:86:79:82:FD:78:49:65:77:05:85:4F:A5:F3:F5:D6:54:E8:CD:A7:
DA:F5:6E:0F:32:F1:B3:BE:09:52:1B:33:18:BC:0A:56:1D:73:10:66:
E7:6A:6F:A7:A6:08:28:64:D4:3E:EB:66:64:C0:C1:3D:E7:16:1A:38:
A3:D5:3B:4E:0F:05:83:A1:1E:95:44:20:D9:19:C7:5D:9C:CA:E8:3E:
F5:C9:6F:5E:88:6B:50:0B:8B:B0:EF:6C:E2:5F:61:39:91:32:E0:C6:
67:92:C9:9F:8F:5E:2D:E9:9C:D7:07:7B:7E:AF:AC:3F:FB:72:B3:2D:
37:93:FC:24:0C:55:4F:28:53:D5:5D:66:AB:2F:E1:CF:A7:EE:6C:C3:
71:4A:9D:1B:85:4F:62:DA:12:FF:D1:87:F5:4C:48:2B:F1:5D:9F:24:
50:A4:BA:1D:6B:99:77:61:9B:65:39:9F:56:51:5F:65:C4:4F:3E:5D:
A0:91:93:E0:5E:7C:73:95:D8:C8:B1:E2:D9:BE:F5:0E:D5:82:64:D8:
01:C6:49:0D:1F:C0:CD:DC:5B:99:43:86:95:05:B6:8A:30:44:57:4E:
C9:75:FC:09
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# modinfo ib_umad
filename: /lib/modules/4.18.0-372.9.1.el8.x86_64/kernel/drivers/infiniband/core/ib_umad.ko.xz
alias: rdma-client-issm
alias: rdma-client-umad
license: Dual BSD/GPL
description: InfiniBand userspace MAD packet access
author: Roland Dreier
rhelversion: 8.6
srcversion: EEA36F7782E21E939DF90E0
depends: ib_core
intree: Y
name: ib_umad
vermagic: 4.18.0-372.9.1.el8.x86_64 SMP mod_unload modversions
sig_id: PKCS#7
signer: Rocky kernel signing key
sig_key: 24:62:83:5E:57:6D:46:8C:7B:45:DD:87:7C:69:5A:C6:BC:46:85:94
sig_hashalgo: sha256
signature: 1D:FC:C2:92:9D:C7:32:66:5A:09:CD:64:64:96:A5:12:4A:4B:84:F6:
4C:0E:12:B0:61:F4:55:49:D3:05:79:02:90:F3:AF:40:0D:4A:96:62:
30:7B:D5:42:C9:9F:6C:CD:9C:EF:D9:D5:B9:B4:FC:73:C3:3E:25:9C:
07:0E:C8:90:CA:72:08:A7:67:93:1F:EB:ED:89:B9:AA:16:17:91:CE:
1E:18:D6:80:C1:CA:03:8F:04:C8:03:AC:49:B0:D6:4E:EA:F4:2D:6E:
9E:9D:83:F2:33:EF:6B:AF:D3:EA:6E:8B:47:9C:5A:29:11:B9:3F:CF:
16:88:55:6F:38:0E:95:01:38:75:EE:81:15:2E:8F:F5:A1:F2:1D:33:
04:49:0A:E9:DE:3C:D5:27:17:AE:12:96:0A:DE:9E:DB:CD:3B:0D:E6:
22:9F:26:CB:44:C2:56:9D:06:27:9E:F4:A5:AC:D9:8D:A8:B4:3B:94:
23:74:02:F2:55:75:B8:65:AD:8A:F7:B7:8B:9C:BD:7E:B0:D6:CF:C9:
33:08:F2:5A:91:DC:36:57:72:21:1D:E0:E1:EF:F5:C4:4B:FA:C3:4C:
95:D3:8C:8D:50:3F:CC:B8:0F:0A:84:7E:F3:C2:8E:9D:EE:F9:D8:B5:
19:9B:65:42:D0:37:77:10:3B:CF:D6:92:85:BD:D0:55:A1:2C:6D:2F:
FD:8E:17:87:C4:4B:E2:D7:12:9C:73:B0:A1:63:9B:FE:2E:2D:FC:94:
E8:E2:0C:CA:F2:1D:EC:27:E1:D5:9B:00:F1:08:53:8B:A3:92:F1:10:
30:D2:91:F6:5F:F0:B6:C2:2A:82:86:D9:ED:20:BB:9B:BF:EF:4C:4A:
A2:9B:DB:CF:E9:64:5D:7D:E8:0D:A6:22:25:B3:1A:F8:F5:63:E0:D4:
7B:96:9E:AF:24:38:54:56:35:53:C3:AC:49:C0:CD:D5:33:8A:56:7D:
D7:C0:46:6F:9A:97:A3:F2:7E:14:3C:9A:6A:6D:36:EF:D5:F2:4A:10:
E2:02:52:AC
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# modinfo mlx5_ib
filename: /lib/modules/4.18.0-372.9.1.el8.x86_64/kernel/drivers/infiniband/hw/mlx5/mlx5_ib.ko.xz
license: Dual BSD/GPL
description: Mellanox 5th generation network adapters (ConnectX series) IB driver
author: Eli Cohen <eli@mellanox.com>
rhelversion: 8.6
srcversion: D733C181AA9D6B40A8CBDD4
alias: auxiliary:mlx5_core.rdma
alias: auxiliary:mlx5_core.multiport
alias: auxiliary:mlx5_core.rdma-rep
depends: mlx5_core,ib_core,ib_uverbs
intree: Y
name: mlx5_ib
vermagic: 4.18.0-372.9.1.el8.x86_64 SMP mod_unload modversions
sig_id: PKCS#7
signer: Rocky kernel signing key
sig_key: 24:62:83:5E:57:6D:46:8C:7B:45:DD:87:7C:69:5A:C6:BC:46:85:94
sig_hashalgo: sha256
signature: 4A:A4:31:5C:5E:15:11:F8:29:44:2D:BA:41:1B:1E:5E:0D:B2:E4:2A:
72:9C:7C:F5:A2:5E:09:41:85:CF:4E:91:6D:1D:21:7F:3B:1D:B6:F7:
B0:F4:F3:CA:9D:51:9C:60:96:47:11:F3:DB:52:0E:C0:AF:21:40:5F:
3D:C9:48:29:2B:3A:FE:84:A6:92:4B:52:57:AA:A0:4C:D7:FE:29:D1:
74:6B:F8:67:0F:6F:52:3C:DD:0F:69:7B:D0:F5:13:14:22:F8:23:F2:
A1:78:CE:A3:4F:88:FC:8D:D6:A4:0D:A8:6B:82:13:AC:E7:3E:E3:B6:
A2:4E:B7:64:97:CA:03:32:AB:FF:0E:4D:08:2B:4C:F1:88:93:6F:97:
D2:D5:74:79:77:77:E1:15:71:06:AC:7C:AB:97:23:04:16:E4:59:A5:
14:01:2D:CF:F1:EF:3D:29:9B:9A:FB:43:01:BE:9F:34:89:2E:92:30:
87:6C:0F:04:9E:88:A2:EC:D1:E5:76:9A:A0:12:62:B3:86:30:CF:0A:
99:57:6C:98:29:F0:43:47:87:47:F3:0F:E7:F5:15:A0:D0:3D:98:83:
36:71:32:D0:BF:60:A4:B0:3D:1A:24:AF:9C:CC:12:67:10:6A:47:62:
08:A8:A5:72:1F:AF:46:D9:56:F0:D0:2C:D6:C5:C5:D7:CF:44:54:F2:
A5:49:F4:E8:76:B4:F1:82:0B:C8:7C:99:38:4C:86:DB:60:F6:7E:0B:
D8:4D:19:B8:D1:BE:20:F2:22:5F:B8:DE:7E:FE:18:D9:A0:35:E3:B6:
18:33:7E:C6:DC:C8:3A:5E:16:7F:14:61:FA:65:FE:E3:51:98:01:B1:
99:49:81:69:A0:4B:32:64:9F:6B:F8:5F:4A:8A:50:E9:15:D7:A6:FB:
D3:06:3D:EE:94:69:2A:9A:D4:9A:61:67:F9:8D:42:2D:44:A1:EE:B8:
D6:1B:75:FE:EF:85:35:7B:00:A9:F9:04:81:68:99:6A:FD:71:51:27:
06:ED:8A:2B
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# ibstat
CA 'mlx5_0'
CA type: MT4119
Number of ports: 1
Firmware version: 16.31.1014
Hardware version: 0
Node GUID: 0x043f720300dc0684
System image GUID: 0x043f720300dc0684
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x063f72fffedc0684
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4119
Number of ports: 1
Firmware version: 16.31.1014
Hardware version: 0
Node GUID: 0x043f720300dc0685
System image GUID: 0x043f720300dc0684
Port 1:
State: Down
Physical state: Disabled
Rate: 40
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x063f72fffedc0685
Link layer: Ethernet
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]#
[root@node001 ~]# ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.31.1014
node_guid: 043f:7203:00dc:0684
sys_image_guid: 043f:7203:00dc:0684
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 16.31.1014
node_guid: 043f:7203:00dc:0685
sys_image_guid: 043f:7203:00dc:0684
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
@karanveersingh5623
could you please create a file called 50-mellanox.env
with the following string inside MELLANOX_VISIBLE_DEVICES=all
put it in /etc/enroot/environ.d/
directory on node002
and every compute node you are using
and rerun
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibv_devinfo
@Artemy-Mellanox
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 1 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area ibv_devinfo
pyxis: importing docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.31.1014
node_guid: 043f:7203:00dc:0684
sys_image_guid: 043f:7203:00dc:0684
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
hca_id: mlx5_1
transport: InfiniBand (0)
fw_ver: 16.31.1014
node_guid: 043f:7203:00dc:0685
sys_image_guid: 043f:7203:00dc:0684
vendor_id: 0x02c9
vendor_part_id: 4119
hw_ver: 0x0
board_id: MT_0000000012
phys_port_cnt: 1
port: 1
state: PORT_DOWN (1)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
probably that was the missing part
could you please rerun the original run_and_time.sh
scenario
@Artemy-Mellanox Thanks for it , it worked :) But its taking a hell lot of time , after 2.5 hrs , its still at epoch 1 . Total epochs are 5 . When I run on single node with multi-GPUs , the same task finishes in 15~20 min .
[root@bright88 mxnet]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
149 defq bash root R 2:38:18 2 node[001-002]
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,OMPI_MCA_coll=^hcoll" --kill-on-bad-exit=0 --mpi=pmix_v3 -N 2 --gpus-per-task=1 --gpu-bind=none --ntasks-per-node=1 --exclusive --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
STARTING TIMING RUN AT 2022-11-22 02:31:51 PM
running benchmark
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=32
+ exec numactl --physcpubind=0-31,64-95 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
STARTING TIMING RUN AT 2022-11-22 02:31:57 PM
running benchmark
dpkg: warning: version '4.18.0-372.9.1.el8.x86_64' has bad syntax: invalid character in revision number
num_sockets = 2 num_nodes=2 cores_per_socket=22
+ exec numactl --physcpubind=0-21,44-65 --membind=0 -- python train.py --log-prefix 'run__{}_.log' --data-root-dir /data --num-epochs 5 --target-mae 0.124 --base-lr 0.001 --initial-lr 0.001 --momentum 0.9 --weight-decay 0.0 --warmup-epochs 1 --lr-scheduler-epochs 32 64 --lr-scheduler-decays 0.25 0.125 --training-batch-size 16 --validation-batch-size 16 --training-samples -1 --validation-samples -1 --data-layout NDHWC --data-shard-multiplier 1 --dali-num-threads 64 --shard-type local --seed 0 --grad-prediv-factor 1.0 --instances 1 --spatial-span 1 --load-checkpoint '' --save-checkpoint /results/checkpoint.data --apply-log-transform --shuffle --preshuffle
[14:31:59] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
Namespace(apply_log_transform=True, base_lr=0.001, config_file=None, cuda_profiler_range='', dali_num_threads=64, dali_use_mmap=False, data_layout='NDHWC', data_root_dir=PosixPath('/data'), data_shard_multiplier=1, dropout=0.5, grad_prediv_factor=1.0, initial_lr=0.001, instances=1, load_checkpoint='', log_prefix='run__{}_.log', lr_scheduler_decays=[0.25, 0.125], lr_scheduler_epochs=[32, 64], momentum=0.9, num_epochs=5, preshuffle=True, prestage=False, profile=False, save_checkpoint='/results/checkpoint.data', seed=0, shard_type='local', shuffle=True, spatial_span=1, static_loss_scale=16384, target_mae=0.124, training_batch_size=16, training_samples=-1, use_amp=False, use_fp16=False, validation_batch_size=16, validation_samples=-1, warmup_epochs=1, weight_decay=0.0)
:::MLLOG {"namespace": "", "time_ms": 1669095119147, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": null, "metadata": {"file": "train.py", "lineno": 134}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "INTERVAL_START", "key": "init_start", "value": null, "metadata": {"file": "train.py", "lineno": 135}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_benchmark", "value": "cosmoflow", "metadata": {"file": "train.py", "lineno": 137}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_org", "value": "NVIDIA", "metadata": {"file": "train.py", "lineno": 139}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_division", "value": "closed", "metadata": {"file": "train.py", "lineno": 140}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_status", "value": "onprem", "metadata": {"file": "train.py", "lineno": 141}}
:::MLLOG {"namespace": "", "time_ms": 1669095119176, "event_type": "POINT_IN_TIME", "key": "submission_platform", "value": "2xNVIDIA DGX A100", "metadata": {"file": "train.py", "lineno": 142}}
:::MLLOG {"namespace": "", "time_ms": 1669095119177, "event_type": "POINT_IN_TIME", "key": "number_of_nodes", "value": 2, "metadata": {"file": "train.py", "lineno": 145}}
:::MLLOG {"namespace": "", "time_ms": 1669095119177, "event_type": "POINT_IN_TIME", "key": "accelerators_per_node", "value": 1, "metadata": {"file": "train.py", "lineno": 146}}
[14:31:59] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
[14:32:00] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[14:32:00] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for GPU
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for dgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 64 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for wgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 64 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for wgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 32 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for dgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 64 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for wgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 64 num_group: 1 workspace: 1024
[14:32:05] ../src/operator/cudnn_ops.cc:441: Using fallback engine(s) for wgrad float NDHWC kernel: [3,3,3] stride: [1,1,1] dilate: [1,1,1] pad: [1,1,1] num_filter: 32 num_group: 1 workspace: 1024
node001:3316259:3316286 [0] NCCL INFO Bootstrap : Using eth3:192.168.61.89<0>
node001:3316259:3316286 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
node001:3316259:3316286 [0] NCCL INFO P2P plugin IBext
node001:3316259:3316286 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth3:192.168.61.89<0>
node001:3316259:3316286 [0] NCCL INFO Using network IBext
NCCL version 2.11.4+cuda11.4
node002:2000130:2000157 [0] NCCL INFO Bootstrap : Using eth4:192.168.61.90<0>
node002:2000130:2000157 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
node002:2000130:2000157 [0] NCCL INFO P2P plugin IBext
node002:2000130:2000157 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE ; OOB eth4:192.168.61.90<0>
node002:2000130:2000157 [0] NCCL INFO Using network IBext
node001:3316259:3316286 [0] NCCL INFO Channel 00/02 : 0 1
node001:3316259:3316286 [0] NCCL INFO Channel 01/02 : 0 1
node001:3316259:3316286 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1
node001:3316259:3316286 [0] NCCL INFO Setting affinity for GPU 0 to 55555555,00000000,55555555
node002:2000130:2000157 [0] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1
node002:2000130:2000157 [0] NCCL INFO Setting affinity for GPU 0 to 02,aaaaa000,002aaaaa
node001:3316259:3316286 [0] NCCL INFO Channel 00 : 1[af000] -> 0[17000] [receive] via NET/IBext/0
node001:3316259:3316286 [0] NCCL INFO Channel 01 : 1[af000] -> 0[17000] [receive] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Channel 00 : 0[17000] -> 1[af000] [receive] via NET/IBext/0
node001:3316259:3316286 [0] NCCL INFO Channel 00 : 0[17000] -> 1[af000] [send] via NET/IBext/0
node001:3316259:3316286 [0] NCCL INFO Channel 01 : 0[17000] -> 1[af000] [send] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Channel 01 : 0[17000] -> 1[af000] [receive] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Channel 00 : 1[af000] -> 0[17000] [send] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Channel 01 : 1[af000] -> 0[17000] [send] via NET/IBext/0
node002:2000130:2000157 [0] NCCL INFO Connected all rings
node002:2000130:2000157 [0] NCCL INFO Connected all trees
node002:2000130:2000157 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
node002:2000130:2000157 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
node002:2000130:2000157 [0] NCCL INFO comm 0x15540845f760 rank 1 nranks 2 cudaDev 0 busId af000 - Init COMPLETE
node001:3316259:3316286 [0] NCCL INFO Connected all rings
node001:3316259:3316286 [0] NCCL INFO Connected all trees
node001:3316259:3316286 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/512
node001:3316259:3316286 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
node001:3316259:3316286 [0] NCCL INFO comm 0x155408460740 rank 0 nranks 2 cudaDev 0 busId 17000 - Init COMPLETE
node001:3316259:3316286 [0] NCCL INFO Launch mode Parallel
:::MLLOG {"namespace": "", "time_ms": 1669097805467, "event_type": "POINT_IN_TIME", "key": "opt_weight_decay", "value": 0.0, "metadata": {"file": "train.py", "lineno": 165}}
:::MLLOG {"namespace": "", "time_ms": 1669097805467, "event_type": "POINT_IN_TIME", "key": "dropout", "value": 0.5, "metadata": {"file": "train.py", "lineno": 167}}
:::MLLOG {"namespace": "", "time_ms": 1669097805555, "event_type": "POINT_IN_TIME", "key": "global_batch_size", "value": 32, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 352}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "train_samples", "value": 32768, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 354}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "eval_samples", "value": 16384, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 355}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_base_learning_rate", "value": 0.001, "metadata": {"file": "train.py", "lineno": 92}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_epochs", "value": 1, "metadata": {"file": "train.py", "lineno": 94}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_warmup_factor", "value": 1, "metadata": {"file": "train.py", "lineno": 96}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_decay_boundary_epochs", "value": [32, 64], "metadata": {"file": "train.py", "lineno": 98}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_learning_rate_decay_factor", "value": [0.25, 0.125], "metadata": {"file": "train.py", "lineno": 100}}
:::MLLOG {"namespace": "", "time_ms": 1669097805556, "event_type": "POINT_IN_TIME", "key": "opt_name", "value": "sgd", "metadata": {"file": "train.py", "lineno": 184}}
:::MLLOG {"namespace": "", "time_ms": 1669097805566, "event_type": "INTERVAL_END", "key": "init_stop", "value": null, "metadata": {"file": "/workspace/cosmoflow/utils.py", "lineno": 144}}
:::MLLOG {"namespace": "", "time_ms": 1669097805566, "event_type": "INTERVAL_START", "key": "run_start", "value": null, "metadata": {"file": "train.py", "lineno": 206}}
:::MLLOG {"namespace": "", "time_ms": 1669097805566, "event_type": "INTERVAL_START", "key": "staging_start", "value": null, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 359}}
:::MLLOG {"namespace": "", "time_ms": 1669097807786, "event_type": "INTERVAL_END", "key": "staging_stop", "value": null, "metadata": {"file": "/workspace/cosmoflow/data.py", "lineno": 362, "staging_duration": 2.219392776489258}}
:::MLLOG {"namespace": "", "time_ms": 1669097807786, "event_type": "INTERVAL_START", "key": "epoch_start", "value": null, "metadata": {"file": "train.py", "lineno": 215, "epoch_num": 1}}
@karanveersingh5623 can we narrow down this issue by running some basic rdma performance tests on this setup?
@Artemy-Mellanox , lets do it :)
after 19 Hrs , still at epoch 3 .
[root@bright88 mxnet]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
149 defq bash root R 19:51:40 2 node[001-002]
@Artemy-Mellanox How can I verify RDMA ? I guess I havent installed any MLNX_OFED driver on hosts . I am using ConnectX-5 with TCP I have not installed the below packages . Hosts are on Rocky Linux 8.6
# yum -y groupinstall "InfiniBand Support"
# yum -y install perftest infiniband-diags
It is recommended to install the latest MLNX_OFED, however, it is possible to use the RDMA inbox drivers.
RDMA / RoCE with Connect X-5 Gbe card is possible but I guess me just using TCP packets for communications . Please correct me and let me know the next steps
@karanveersingh5623 could you please install perftest
package to test rdma between nodes
Try to add to your script
sleep $SLURM_NODEID && ib_send_bw $([ $SLURM_NODEID == 0 ] || echo node001)
@Artemy-Mellanox
Server
[root@node001 perftest]# ./ib_send_bw
************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
Send BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
RX depth : 512
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x4f24 PSN 0x3c89e0
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:89
remote address: LID 0000 QPN 0x1de0 PSN 0x67f45c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:90
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 0.00 6284.84 0.100557
---------------------------------------------------------------------------------------
ethernet_read_keys: Couldn't read remote address
Unable to read to socket/rdma_cm
Failed to close connection between server and client
Trying to close this side resources
Client
[root@node002 perftest]# ./ib_send_bw -b node001
---------------------------------------------------------------------------------------
Send Bidirectional BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
RX depth : 512
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x1de0 PSN 0x67f45c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:90
remote address: LID 0000 QPN 0x4f24 PSN 0x3c89e0
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:89
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
Did not get Message for 120 Seconds, exiting..
Total Received=0, Total Iters Required=1000
ib_send_bw
shows good performance (-b
option need on the server side too so there'd be no error)
What is the expected time to finish the cosmoflow benchmark with this dataset?
@Artemy-Mellanox ...I ran with -b option on server side , I am getting 13GB/s...its bidirectional , so max throughput will be around 5~6 GB/s or 50Gbps.....thats right as my Mellanox card sits on X8 PCIe slot , not X16 otherwise I would have got 100Gbps
[root@node004 perftest]# ./ib_send_bw -b node003
---------------------------------------------------------------------------------------
Send Bidirectional BW Test
Dual-port : OFF Device : mlx5_0
Number of qps : 1 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON
ibv_wr* API : ON
TX depth : 128
RX depth : 512
CQ Moderation : 1
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : OFF
Data ex. method : Ethernet
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x463b PSN 0x645ee4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:92
remote address: LID 0000 QPN 0x4d63 PSN 0x5a236c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:61:91
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[MB/sec] BW average[MB/sec] MsgRate[Mpps]
65536 1000 13188.14 13187.32 0.210997
---------------------------------------------------------------------------------------
ib_send_bw
shows good performance (-b
option need on the server side too so there'd be no error) What is the expected time to finish the cosmoflow benchmark with this dataset?
As I am just running 5 epochs , max time to finish when using single node(multi-GPUs) is 20~25 min .
@karanveersingh5623 could you please download osu benchmark, build it with CUDA
./configure CC=$OMPI_HOME/bin/mpicc CXX=$OMPI_HOME/bin/mpic++ --enable-cuda --with-cuda=$CUDA_HOME
and run mpi/pt2pt/osu_bw D D
test between two nodes using mpirun
@Artemy-Mellanox , configure is failing for OSU benchmark , please refer trace below and let me know any pointers
[root@node001 OSU_Microbenchmarks]# nvidia-smi
Tue Dec 13 16:55:17 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... On | 00000000:17:00.0 Off | 0 |
| N/A 33C P0 41W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100 80G... On | 00000000:65:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100 80G... On | 00000000:CA:00.0 Off | 0 |
| N/A 34C P0 45W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100 80G... On | 00000000:E3:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# ls /cm/shared/apps/cuda11.7/toolkit/11.7.1/
bin C compat compute-sanitizer CUDA_Toolkit_Release_Notes.txt DOCS etc EULA.txt extras gds include lib64 LICENSE man nvml nvvm README share src targets tools version.json
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# ls /usr/lib64/openmpi/bin/
aggregate_profile.pl mpicc mpicxx mpif77 mpifort ompi-clean ompi-server ortecc orted orterun oshc++ oshCC oshfort oshrun shmemc++ shmemCC shmemfort
mpic++ mpiCC mpiexec mpif90 mpirun ompi_info opal_wrapper orte-clean orte-info orte-server oshcc oshcxx oshmem_info profile2mat.pl shmemcc shmemcxx shmemrun
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# ./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpic++ --enable-cuda --with-cuda=/cm/shared/apps/cuda11.7/toolkit/11.7.1
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for style of include used by make... GNU
checking for gcc... /usr/lib64/openmpi/bin/mpicc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by /usr/lib64/openmpi/bin/mpicc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 3458764513820540925
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for ar... ar
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from /usr/lib64/openmpi/bin/mpicc object... ok
checking how to run the C preprocessor... /usr/lib64/openmpi/bin/mpicc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if /usr/lib64/openmpi/bin/mpicc supports -fno-rtti -fno-exceptions... no
checking for /usr/lib64/openmpi/bin/mpicc option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpicc PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpicc static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpicc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether we are using the GNU C compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... (cached) yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... (cached) none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... (cached) gcc3
checking whether we are using the GNU C++ compiler... yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... gcc3
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... (cached) yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... (cached) gcc3
checking how to run the C++ preprocessor... /usr/lib64/openmpi/bin/mpic++ -E
checking for ld used by /usr/lib64/openmpi/bin/mpic++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for /usr/lib64/openmpi/bin/mpic++ option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpic++ PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpic++ static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for library containing sqrt... -lm
checking for library containing pthread_join... none required
checking for library containing clock_gettime... none required
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for inline... inline
checking for getpagesize... yes
checking for gettimeofday... yes
checking for memset... yes
checking for sqrt... yes
checking for MPI_Init... yes
checking for MPI_Accumulate... yes
checking for MPI_Get_accumulate... yes
checking for shmem_barrier_all... no
checking for upc_memput... no
checking whether upcxx_alltoall is declared... no
checking for library containing cuPointerGetAttribute... no
configure: error: cannot link with -lcuda
[root@node001 OSU_Microbenchmarks]#
configure failed to link with CUDA.
could you please attach config.log
to identify the reason
@Artemy-Mellanox ...pfa config.log
You need either install the nvidia-driver-latest-cuda-libs
packages or
add /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs
to LD_LIBRARY_PATH
@Artemy-Mellanox .......same !!
[root@node001 OSU_Microbenchmarks]# ll
total 1448
-rw-r--r-- 1 root root 316266 Dec 13 16:28 aclocal.m4
-rw-r--r-- 1 root root 9579 Dec 13 16:28 CHANGES
-rwxr-xr-x 1 root root 44941 Dec 13 16:28 config.guess
-rw-r--r-- 1 root root 51810 Dec 26 10:30 config.log
-rwxr-xr-x 1 root root 34423 Dec 13 16:28 config.sub
-rwxr-xr-x 1 root root 607857 Dec 13 16:28 configure
-rw-r--r-- 1 root root 6275 Dec 13 16:28 configure.ac
-rw-r--r-- 1 root root 2024 Dec 13 16:28 COPYRIGHT
-rwxr-xr-x 1 root root 18615 Dec 13 16:28 depcomp
-rwxr-xr-x 1 root root 66 Dec 13 16:28 get_local_rank
-rwxr-xr-x 1 root root 13663 Dec 13 16:28 install-sh
-rwxr-xr-x 1 root root 243248 Dec 13 16:28 ltmain.sh
-rw-r--r-- 1 root root 252 Dec 13 16:28 Makefile.am
-rw-r--r-- 1 root root 24933 Dec 13 16:28 Makefile.in
-rwxr-xr-x 1 root root 11419 Dec 13 16:28 missing
drwxr-xr-x 6 root root 135 Dec 13 16:28 mpi
drwxr-xr-x 2 root root 4096 Dec 13 16:28 openshmem
-rw-r--r-- 1 root root 46257 Dec 13 16:28 README
drwxr-xr-x 2 root root 4096 Dec 13 16:28 upc
drwxr-xr-x 2 root root 4096 Dec 13 16:28 upcxx
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# echo $LD_LIBRARY_PATH
[root@node001 OSU_Microbenchmarks]# ls /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs
libcublasLt.so libcuda.so libcufftw.so libcusolverMg.so libcusparse.so libnppial.so libnppidei.so libnppig.so libnppist.so libnppitc.so libnvidia-ml.so libnvrtc.so
libcublas.so libcufft.so libcurand.so libcusolver.so libnppc.so libnppicc.so libnppif.so libnppim.so libnppisu.so libnpps.so libnvjpeg.so
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# export LD_LIBRARY_PATH=/cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# echo $LD_LIBRARY_PATH
/cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]#
[root@node001 OSU_Microbenchmarks]# ./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpic++ --enable-cuda --with-cuda=/cm/shared/apps/cuda11.7/toolkit/11.7.1
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for style of include used by make... GNU
checking for gcc... /usr/lib64/openmpi/bin/mpicc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by /usr/lib64/openmpi/bin/mpicc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 3458764513820540925
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for ar... ar
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from /usr/lib64/openmpi/bin/mpicc object... ok
checking how to run the C preprocessor... /usr/lib64/openmpi/bin/mpicc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if /usr/lib64/openmpi/bin/mpicc supports -fno-rtti -fno-exceptions... no
checking for /usr/lib64/openmpi/bin/mpicc option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpicc PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpicc static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpicc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether we are using the GNU C compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... (cached) yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... (cached) none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... (cached) gcc3
checking whether we are using the GNU C++ compiler... yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... gcc3
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... (cached) yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... (cached) gcc3
checking how to run the C++ preprocessor... /usr/lib64/openmpi/bin/mpic++ -E
checking for ld used by /usr/lib64/openmpi/bin/mpic++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for /usr/lib64/openmpi/bin/mpic++ option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpic++ PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpic++ static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for library containing sqrt... -lm
checking for library containing pthread_join... none required
checking for library containing clock_gettime... none required
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for inline... inline
checking for getpagesize... yes
checking for gettimeofday... yes
checking for memset... yes
checking for sqrt... yes
checking for MPI_Init... yes
checking for MPI_Accumulate... yes
checking for MPI_Get_accumulate... yes
checking for shmem_barrier_all... no
checking for upc_memput... no
checking whether upcxx_alltoall is declared... no
checking for library containing cuPointerGetAttribute... no
configure: error: cannot link with -lcuda
Attaching config.log config.log
Could you please add LDFLAGS=-Wl,--verbose
option to ./configure
and then attach config.log
, like
./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpic++ LDFLAGS=-Wl,--verbose --enable-cuda --with-cuda=/cm/shared/apps/cuda11.7/toolkit/11.7.1
@Artemy-Mellanox ....PFA config.log
Could you please add /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs
to LIBRARY_PATH
as well
@Artemy-Mellanox , looks like configure is through..... but make is failing , I set the path variable because NVCC was not found
[root@node001 OSU_Microbenchmarks]# ./configure CC=/usr/lib64/openmpi/bin/mpicc CXX=/usr/lib64/openmpi/bin/mpic++ LDFLAGS=-Wl,--verbose --enable-cuda --with-cuda=/cm/shared/apps/cuda11.7/toolkit/11.7.1
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking build system type... x86_64-unknown-linux-gnu
checking host system type... x86_64-unknown-linux-gnu
checking for style of include used by make... GNU
checking for gcc... /usr/lib64/openmpi/bin/mpicc
checking whether the C compiler works... yes
checking for C compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... gcc3
checking for a sed that does not truncate output... /usr/bin/sed
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for fgrep... /usr/bin/grep -F
checking for ld used by /usr/lib64/openmpi/bin/mpicc... /usr/bin/ld
checking if the linker (/usr/bin/ld) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 3458764513820540925
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking for /usr/bin/ld option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... pass_all
checking for ar... ar
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from /usr/lib64/openmpi/bin/mpicc object... ok
checking how to run the C preprocessor... /usr/lib64/openmpi/bin/mpicc -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking for dlfcn.h... yes
checking for objdir... .libs
checking if /usr/lib64/openmpi/bin/mpicc supports -fno-rtti -fno-exceptions... no
checking for /usr/lib64/openmpi/bin/mpicc option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpicc PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpicc static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpicc supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpicc linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking whether -lc should be explicitly linked in... no
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking whether we are using the GNU C compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpicc accepts -g... (cached) yes
checking for /usr/lib64/openmpi/bin/mpicc option to accept ISO C89... (cached) none needed
checking dependency style of /usr/lib64/openmpi/bin/mpicc... (cached) gcc3
checking whether we are using the GNU C++ compiler... yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... gcc3
checking whether we are using the GNU C++ compiler... (cached) yes
checking whether /usr/lib64/openmpi/bin/mpic++ accepts -g... (cached) yes
checking dependency style of /usr/lib64/openmpi/bin/mpic++... (cached) gcc3
checking how to run the C++ preprocessor... /usr/lib64/openmpi/bin/mpic++ -E
checking for ld used by /usr/lib64/openmpi/bin/mpic++... /usr/bin/ld -m elf_x86_64
checking if the linker (/usr/bin/ld -m elf_x86_64) is GNU ld... yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking for /usr/lib64/openmpi/bin/mpic++ option to produce PIC... -fPIC -DPIC
checking if /usr/lib64/openmpi/bin/mpic++ PIC flag -fPIC -DPIC works... yes
checking if /usr/lib64/openmpi/bin/mpic++ static flag -static works... no
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... yes
checking if /usr/lib64/openmpi/bin/mpic++ supports -c -o file.o... (cached) yes
checking whether the /usr/lib64/openmpi/bin/mpic++ linker (/usr/bin/ld -m elf_x86_64) supports shared libraries... yes
checking dynamic linker characteristics... GNU/Linux ld.so
checking how to hardcode library paths into programs... immediate
checking for library containing sqrt... -lm
checking for library containing pthread_join... none required
checking for library containing clock_gettime... none required
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking sys/time.h usability... yes
checking sys/time.h presence... yes
checking for sys/time.h... yes
checking for unistd.h... (cached) yes
checking for inline... inline
checking for getpagesize... yes
checking for gettimeofday... yes
checking for memset... yes
checking for sqrt... yes
checking for MPI_Init... yes
checking for MPI_Accumulate... yes
checking for MPI_Get_accumulate... yes
checking for shmem_barrier_all... no
checking for upc_memput... no
checking whether upcxx_alltoall is declared... no
checking for library containing cuPointerGetAttribute... -lcuda
checking for library containing cudaFree... -lcudart
checking cuda.h usability... yes
checking cuda.h presence... yes
checking for cuda.h... yes
configure: creating ./config.status
config.status: creating Makefile
config.status: creating mpi/Makefile
config.status: creating mpi/pt2pt/Makefile
config.status: creating mpi/startup/Makefile
config.status: creating mpi/one-sided/Makefile
config.status: creating mpi/collective/Makefile
config.status: creating openshmem/Makefile
config.status: creating upc/Makefile
config.status: creating upcxx/Makefile
config.status: executing depfiles commands
config.status: executing libtool commands
make clean & make
[root@node001 OSU_Microbenchmarks]# echo $PATH
/cm/shared/apps/cuda11.7/toolkit/11.7.1/bin:/usr/lib64/openmpi/bin:/cm/local/apps/environment-modules/4.5.3//bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/cm/local/apps/environment-modules/4.5.3/bin:/opt/dell/srvadmin/bin:/opt/dell/srvadmin/sbin:/root/bin
attempt to open //usr/x86_64-redhat-linux/lib64/libevent_core-2.1.so.6 failed
found libevent_core-2.1.so.6 at //usr/lib64/libevent_core-2.1.so.6
libevent_pthreads-2.1.so.6 needed by /usr/lib64/openmpi/lib/libmpi.so
attempt to open //cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //cm/shared/apps/cuda11.7/toolkit/11.7.1/lib/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64/openmpi/lib/libevent_pthreads-2.1.so.6 failed
attempt to open /usr/lib64/openmpi/lib/libevent_pthreads-2.1.so.6 failed
attempt to open /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64/openmpi/lib/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64/atlas/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64//bind9-export/libevent_pthreads-2.1.so.6 failed
attempt to open //cm/local/apps/cuda/libs/current/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/lib64/dyninst/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/srvadmin-idrac7/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/srvadmin-isvc/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/smpop/libevent_pthreads-2.1.so.6 failed
attempt to open //opt/dell/srvadmin/lib64/libevent_pthreads-2.1.so.6 failed
attempt to open //usr/x86_64-redhat-linux/lib64/libevent_pthreads-2.1.so.6 failed
found libevent_pthreads-2.1.so.6 at //usr/lib64/libevent_pthreads-2.1.so.6
libcrypto.so.1.1 needed by //usr/lib64/libevent_core-2.1.so.6
attempt to open //cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/libcrypto.so.1.1 failed
attempt to open //cm/shared/apps/cuda11.7/toolkit/11.7.1/lib/libcrypto.so.1.1 failed
attempt to open //usr/lib64/openmpi/lib/libcrypto.so.1.1 failed
attempt to open /usr/lib64/openmpi/lib/libcrypto.so.1.1 failed
attempt to open /cm/shared/apps/cuda11.7/toolkit/11.7.1/lib64/stubs/libcrypto.so.1.1 failed
attempt to open //usr/lib64/atlas/libcrypto.so.1.1 failed
attempt to open //usr/lib64//bind9-export/libcrypto.so.1.1 failed
attempt to open //cm/local/apps/cuda/libs/current/lib64/libcrypto.so.1.1 failed
attempt to open //usr/lib64/dyninst/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/srvadmin-idrac7/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/srvadmin-isvc/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/openmanage/smpop/libcrypto.so.1.1 failed
attempt to open //opt/dell/srvadmin/lib64/libcrypto.so.1.1 failed
attempt to open //usr/x86_64-redhat-linux/lib64/libcrypto.so.1.1 failed
found libcrypto.so.1.1 at //usr/lib64/libcrypto.so.1.1
make[2]: Leaving directory '/cm/shared/OSU_Microbenchmarks/mpi/one-sided'
make[2]: Entering directory '/cm/shared/OSU_Microbenchmarks/mpi'
make[2]: Nothing to be done for 'all-am'.
make[2]: Leaving directory '/cm/shared/OSU_Microbenchmarks/mpi'
make[1]: Leaving directory '/cm/shared/OSU_Microbenchmarks/mpi'
make[1]: Entering directory '/cm/shared/OSU_Microbenchmarks'
make[1]: Nothing to be done for 'all-am'.
make[1]: Leaving directory '/cm/shared/OSU_Microbenchmarks'
[root@node001 OSU_Microbenchmarks]# ll
total 2628
-rw-r--r-- 1 root root 316266 Dec 30 16:11 aclocal.m4
drwxr-xr-x 2 root root 70 Dec 30 16:11 autom4te.cache
-rw-r--r-- 1 root root 9579 Dec 13 16:28 CHANGES
-rwxr-xr-x 1 root root 44941 Dec 13 16:28 config.guess
-rw-r--r-- 1 root root 951459 Dec 30 16:15 config.log
-rwxr-xr-x 1 root root 67674 Dec 30 16:15 config.status
-rwxr-xr-x 1 root root 34423 Dec 13 16:28 config.sub
-rwxr-xr-x 1 root root 550027 Dec 30 16:11 configure
-rw-r--r-- 1 root root 6275 Dec 13 16:28 configure.ac
-rw-r--r-- 1 root root 2024 Dec 13 16:28 COPYRIGHT
-rwxr-xr-x 1 root root 18615 Dec 13 16:28 depcomp
-rwxr-xr-x 1 root root 66 Dec 13 16:28 get_local_rank
-rwxr-xr-x 1 root root 13663 Dec 13 16:28 install-sh
-rwxr-xr-x 1 root root 264446 Dec 30 16:15 libtool
-rwxr-xr-x 1 root root 243248 Dec 13 16:28 ltmain.sh
-rw-r--r-- 1 root root 26427 Dec 30 16:15 Makefile
-rw-r--r-- 1 root root 252 Dec 13 16:28 Makefile.am
-rw-r--r-- 1 root root 24933 Dec 30 16:11 Makefile.in
-rwxr-xr-x 1 root root 11419 Dec 13 16:28 missing
drwxr-xr-x 6 root root 155 Dec 30 16:15 mpi
drwxr-xr-x 3 root root 4096 Dec 30 16:15 openshmem
-rw-r--r-- 1 root root 46257 Dec 13 16:28 README
drwxr-xr-x 3 root root 4096 Dec 30 16:15 upc
drwxr-xr-x 3 root root 4096 Dec 30 16:15 upcxx
@Artemy-Mellanox ...... I tried running
[root@bright88 pt2pt]# pwd
/cm/shared/OSU_Microbenchmarks/mpi/pt2pt
[root@bright88 pt2pt]#
[root@bright88 pt2pt]#
[root@bright88 pt2pt]# ll
total 844
-rw-r--r-- 1 root root 16 Dec 30 17:15 hostfile
-rw-r--r-- 1 root root 19453 Dec 30 17:02 Makefile
-rw-r--r-- 1 root root 784 Dec 13 16:28 Makefile.am
-rw-r--r-- 1 root root 18817 Dec 30 16:11 Makefile.in
-rwxr-xr-x 1 root root 68584 Dec 30 16:51 osu_bibw
-rw-r--r-- 1 root root 4528 Dec 13 16:28 osu_bibw.c
-rw-r--r-- 1 root root 30032 Dec 30 16:51 osu_bibw.o
-rwxr-xr-x 1 root root 68584 Dec 30 16:51 osu_bw
-rw-r--r-- 1 root root 4111 Dec 13 16:28 osu_bw.c
-rw-r--r-- 1 root root 29640 Dec 30 16:51 osu_bw.o
-rwxr-xr-x 1 root root 68032 Dec 30 16:51 osu_latency
-rw-r--r-- 1 root root 3705 Dec 13 16:28 osu_latency.c
-rwxr-xr-x 1 root root 78936 Dec 30 16:51 osu_latency_mt
-rw-r--r-- 1 root root 6879 Dec 13 16:28 osu_latency_mt.c
-rw-r--r-- 1 root root 51384 Dec 30 16:51 osu_latency_mt.o
-rw-r--r-- 1 root root 28000 Dec 30 16:51 osu_latency.o
-rwxr-xr-x 1 root root 51136 Dec 30 16:51 osu_mbw_mr
-rw-r--r-- 1 root root 10684 Dec 13 16:28 osu_mbw_mr.c
-rw-r--r-- 1 root root 61392 Dec 30 16:51 osu_mbw_mr.o
-rwxr-xr-x 1 root root 70944 Dec 30 16:51 osu_multi_lat
-rw-r--r-- 1 root root 4757 Dec 13 16:28 osu_multi_lat.c
-rw-r--r-- 1 root root 43080 Dec 30 16:51 osu_multi_lat.o
-rw-r--r-- 1 root root 17041 Dec 13 16:28 osu_pt2pt.c
-rw-r--r-- 1 root root 2779 Dec 13 16:28 osu_pt2pt.h
-rw-r--r-- 1 root root 67336 Dec 30 16:51 osu_pt2pt.o
[root@bright88 pt2pt]#
[root@bright88 pt2pt]#
[root@bright88 pt2pt]#
[root@bright88 pt2pt]# mpirun -np 2 -hostfile hostfile MV2_USE_CUDA=1 get_local_rank ./osu_latency D D
bash: orted: command not found
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
--------------------------------------------------------------------------
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: bright88
target node: node002
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
--------------------------------------------------------------------------
@Artemy-Mellanox ....managed to go further ....
[root@bright88 pt2pt]# /cm/shared/apps/openmpi4/gcc/4.1.2/bin/mpirun -np 2 -hostfile hostfile ./osu_bw D D
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: node001
Local device: mlx5_0
Local port: 1
CPCs attempted: rdmacm, udcm
--------------------------------------------------------------------------
# OSU MPI-CUDA Bandwidth Test v5.3.2
# Send Buffer on DEVICE (D) and Receive Buffer on DEVICE (D)
# Size Bandwidth (MB/s)
[node001:1935046] *** Process received signal ***
[node001:1935046] Signal: Segmentation fault (11)
[node001:1935046] Signal code: Invalid permissions (2)
[node001:1935046] Failing at address: 0x155523200000
[node001:1935046] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x155553341cf0]
[node001:1935046] [ 1] /lib64/libc.so.6(+0xd003c)[0x15555303903c]
[node001:1935046] [ 2] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/libopen-pal.so.40(opal_convertor_pack+0x1a8)[0x155552633808]
[node001:1935046] [ 3] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/openmpi/mca_btl_vader.so(mca_btl_vader_sendi+0x11c)[0x15550c4b3d6c]
[node001:1935046] [ 4] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/openmpi/mca_pml_ob1.so(+0xade4)[0x155506312de4]
[node001:1935046] [ 5] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_isend+0x4ff)[0x155506313a8f]
[node001:1935046] [ 6] /cm/shared/apps/openmpi4/gcc/4.1.2/lib/libmpi.so.40(MPI_Isend+0x125)[0x1555535d3045]
[node001:1935046] [ 7] ./osu_bw[0x40128b]
[node001:1935046] [ 8] /lib64/libc.so.6(__libc_start_main+0xe5)[0x155552fa3d85]
[node001:1935046] [ 9] ./osu_bw[0x4015ee]
[node001:1935046] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 1935046 on node node001 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[bright88:3782908] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
[bright88:3782908] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[root@bright88 pt2pt]#
[root@bright88 pt2pt]# /cm/shared/apps/openmpi4/gcc/4.1.2/bin/mpirun -np 2 -hostfile hostfile MV2_USE_CUDA=1 ./osu_bw D H
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: node001
Executable: MV2_USE_CUDA=1
--------------------------------------------------------------------------
2 total processes failed to start
[root@bright88 pt2pt]#
@karanveersingh5623 can you pls add "-mca pml ucx -mca btl self --report-bindings" to mpirun command?
@yosefe @Artemy-Mellanox Below is the output
[root@bright88 pt2pt]# /cm/shared/apps/openmpi4/gcc/4.1.2/bin/mpirun --mca pml ucx --mca btl self --report-bindings -np 2 -hostfile hostfile MV2_USE_CUDA=1 ./osu_bw D H
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job. This error was first reported for process
rank 0; it may have occurred for other processes as well.
NOTE: A common cause for this error is misspelling a mpirun command
line parameter option (remember that mpirun interprets the first
unrecognized command line token as the executable).
Node: node001
Executable: MV2_USE_CUDA=1
--------------------------------------------------------------------------
2 total processes failed to start
[root@bright88 pt2pt]#
pls remove 'MV2_USE_CUDA=1 '
Describe the bug
A clear and concise description of what the bug is.
Trying to run docker container for data preprocessing , its MLPerf Cosmoflow NVIDIA implementation , below is the link The MPI process , trying to run a shell script inside docker container runs fine for training folder but fails for validation , below is the script details [init_datasets.sh]: Please let me know if you need more info
Error msg when running container using srun
Steps to Reproduce
Command line
UCX version used (from github branch XX or release YY) + UCX configure flags (can be checked by
ucx_info -v
)Any UCX environment variables used
Setup and versions
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issue
orcat /etc/redhat-release
+uname -a
[root@bright88 burst-buffer]# cat /etc/centos-release CentOS Linux release 7.9.2009 (Core) [root@bright88 burst-buffer]# uname -r 3.10.0-1160.11.1.el7.x86_64For Nvidia Bluefield SmartNIC include
cat /etc/mlnx-release
(the string identifies software and firmware setup)rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
Just Using Mellanox connectX-5 tcp stack on all servers
ibstat
oribv_devinfo -vv
commandlsmod|grep nv_peer_mem
and/or gdrcopy:lsmod|grep gdrdrv
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found |
[root@bright88 ~]# mpiexec --version mpiexec (OpenRTE) 4.1.2
Report bugs to http://www.open-mpi.org/community/help/ [root@bright88 ~]# mpirun --version mpirun (Open MPI) 4.1.2