H100*4 running benchmark freeze

Hi, I followed the instructions and tried to run benchmark on a server:

CPU: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
RAM: 128G
GPU: 4xH100 80GB PCIe
GPU Driver version: 520.61.07

Prepare datasets with:

docker run --gpus all --rm --shm-size=64g -v ~/DeepLearningExamples/PyTorch:/workspace/benchmark -v ~/data:/data -v $(pwd)"/scripts":/scripts nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash -c "cp -r /scripts/* /workspace;  ./run_prepare.sh"

Then tried:

docker run --rm --ipc=host --gpus all -v ~/DeepLearningExamples/PyTorch:/workspace/benchmark -v ~/data:/data -v $(pwd)"/scripts":/scripts -v $(pwd)"/results":/results nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash -c "cp -r /scripts/* /workspace; ./run_benchmark.sh 4xH100_80GB_PCIe5_v1 all 1500"

Output:

=============
== PyTorch ==
=============

NVIDIA Release 22.10 (build 46164382)
PyTorch Version 1.13.0a0+d0d6b1f

Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU                      (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015      Google Inc.
Copyright (c) 2015      Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting termcolor
  Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Installing collected packages: termcolor
Successfully installed termcolor-2.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/NVIDIA/dllogger
  Running command git clone -q https://github.com/NVIDIA/dllogger /tmp/pip-req-build-i67zm7ds
  Cloning https://github.com/NVIDIA/dllogger to /tmp/pip-req-build-i67zm7ds
  Resolved https://github.com/NVIDIA/dllogger to commit 0540a43971f4a8a16693a9de9de73c1072020769
Building wheels for collected packages: DLLogger
  Building wheel for DLLogger (setup.py): started
  Building wheel for DLLogger (setup.py): finished with status 'done'
  Created wheel for DLLogger: filename=DLLogger-1.0.0-py3-none-any.whl size=5670 sha256=05d7f11515a88b1de48e64ca8fbe8f6accb8f859d60aca70688ab9b13efeb405
  Stored in directory: /tmp/pip-ephem-wheel-cache-s60971xq/wheels/ad/94/cf/8f3396cb8d62d532695ec557e193fada55cd366e14fd9a02be
Successfully built DLLogger
Installing collected packages: DLLogger
Successfully installed DLLogger-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
/workspace /workspace
ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
/workspace
4xH100_80GB_PCIe5_v1

PyTorch_ncf_FP32 started: 
/workspace /workspace
************************************************************
--data /data/ncf/cache/ml-20m --epochs 2 --batch_size 32000000
************************************************************
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
DLL 2023-10-13 09:27:01.015101 - PARAMETER data : /data/ncf/cache/ml-20m  feature_spec_file : feature_spec.yaml  epochs : 2  batch_size : 32000000  valid_batch_size : 1048576  factors : 64  layers : [256, 256, 128, 64]  negative_samples : 4  learning_rate : 0.0045  topk : 10  seed : None  threshold : 1.0  beta1 : 0.25  beta2 : 0.5  eps : 1e-08  dropout : 0.5  checkpoint_dir :   load_checkpoint_path : None  mode : train  grads_accumulated : 1  amp : False  log_path : log.json  world_size : 4  distributed : True  local_rank : 0 
DistributedDataParallel(
  (module): NeuMF(
    (mf_user_embed): Embedding(138493, 64)
    (mf_item_embed): Embedding(26744, 64)
    (mlp_user_embed): Embedding(138493, 128)
    (mlp_item_embed): Embedding(26744, 128)
    (mlp): ModuleList(
      (0): Linear(in_features=256, out_features=256, bias=True)
      (1): Linear(in_features=256, out_features=128, bias=True)
      (2): Linear(in_features=128, out_features=64, bias=True)
    )
    (final): Linear(in_features=128, out_features=1, bias=True)
  )
)
31832577 parameters

In the meantime, nvidia-smi looks like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.07    Driver Version: 520.61.07    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA H100 PCIe    On   | 00000000:4F:00.0 Off |                    0 |
| N/A   29C    P0    72W / 350W |   3881MiB / 81559MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA H100 PCIe    On   | 00000000:52:00.0 Off |                    0 |
| N/A   32C    P0    69W / 350W |    661MiB / 81559MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA H100 PCIe    On   | 00000000:56:00.0 Off |                    0 |
| N/A   31C    P0    72W / 350W |    661MiB / 81559MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA H100 PCIe    On   | 00000000:57:00.0 Off |                    0 |
| N/A   31C    P0    71W / 350W |    621MiB / 81559MiB |    100%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     14485      C   /opt/conda/bin/python            3878MiB |
|    1   N/A  N/A     14486      C   /opt/conda/bin/python             658MiB |
|    2   N/A  N/A     14487      C   /opt/conda/bin/python             658MiB |
|    3   N/A  N/A     14488      C   /opt/conda/bin/python             618MiB |
+-----------------------------------------------------------------------------+

Can not see any error from dmesg output either Does anyone have a clue ?

lambdal / deeplearning-benchmark

H100*4 running benchmark freeze #21