Hi, I followed the instructions and tried to run benchmark on a server:
CPU: Intel(R) Xeon(R) Platinum 8352V CPU @ 2.10GHz
RAM: 128G
GPU: 4xH100 80GB PCIe
GPU Driver version: 520.61.07
Prepare datasets with:
docker run --gpus all --rm --shm-size=64g -v ~/DeepLearningExamples/PyTorch:/workspace/benchmark -v ~/data:/data -v $(pwd)"/scripts":/scripts nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash -c "cp -r /scripts/* /workspace; ./run_prepare.sh"
Then tried:
docker run --rm --ipc=host --gpus all -v ~/DeepLearningExamples/PyTorch:/workspace/benchmark -v ~/data:/data -v $(pwd)"/scripts":/scripts -v $(pwd)"/results":/results nvcr.io/nvidia/pytorch:22.10-py3 /bin/bash -c "cp -r /scripts/* /workspace; ./run_benchmark.sh 4xH100_80GB_PCIe5_v1 all 1500"
Output:
=============
== PyTorch ==
=============
NVIDIA Release 22.10 (build 46164382)
PyTorch Version 1.13.0a0+d0d6b1f
Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting termcolor
Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Installing collected packages: termcolor
Successfully installed termcolor-2.3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/NVIDIA/dllogger
Running command git clone -q https://github.com/NVIDIA/dllogger /tmp/pip-req-build-i67zm7ds
Cloning https://github.com/NVIDIA/dllogger to /tmp/pip-req-build-i67zm7ds
Resolved https://github.com/NVIDIA/dllogger to commit 0540a43971f4a8a16693a9de9de73c1072020769
Building wheels for collected packages: DLLogger
Building wheel for DLLogger (setup.py): started
Building wheel for DLLogger (setup.py): finished with status 'done'
Created wheel for DLLogger: filename=DLLogger-1.0.0-py3-none-any.whl size=5670 sha256=05d7f11515a88b1de48e64ca8fbe8f6accb8f859d60aca70688ab9b13efeb405
Stored in directory: /tmp/pip-ephem-wheel-cache-s60971xq/wheels/ad/94/cf/8f3396cb8d62d532695ec557e193fada55cd366e14fd9a02be
Successfully built DLLogger
Installing collected packages: DLLogger
Successfully installed DLLogger-1.0.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
/workspace /workspace
ERROR: Directory '.' is not installable. Neither 'setup.py' nor 'pyproject.toml' found.
/workspace
4xH100_80GB_PCIe5_v1
PyTorch_ncf_FP32 started:
/workspace /workspace
************************************************************
--data /data/ncf/cache/ml-20m --epochs 2 --batch_size 32000000
************************************************************
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
DLL 2023-10-13 09:27:01.015101 - PARAMETER data : /data/ncf/cache/ml-20m feature_spec_file : feature_spec.yaml epochs : 2 batch_size : 32000000 valid_batch_size : 1048576 factors : 64 layers : [256, 256, 128, 64] negative_samples : 4 learning_rate : 0.0045 topk : 10 seed : None threshold : 1.0 beta1 : 0.25 beta2 : 0.5 eps : 1e-08 dropout : 0.5 checkpoint_dir : load_checkpoint_path : None mode : train grads_accumulated : 1 amp : False log_path : log.json world_size : 4 distributed : True local_rank : 0
DistributedDataParallel(
(module): NeuMF(
(mf_user_embed): Embedding(138493, 64)
(mf_item_embed): Embedding(26744, 64)
(mlp_user_embed): Embedding(138493, 128)
(mlp_item_embed): Embedding(26744, 128)
(mlp): ModuleList(
(0): Linear(in_features=256, out_features=256, bias=True)
(1): Linear(in_features=256, out_features=128, bias=True)
(2): Linear(in_features=128, out_features=64, bias=True)
)
(final): Linear(in_features=128, out_features=1, bias=True)
)
)
31832577 parameters
In the meantime, nvidia-smi looks like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.07 Driver Version: 520.61.07 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA H100 PCIe On | 00000000:4F:00.0 Off | 0 |
| N/A 29C P0 72W / 350W | 3881MiB / 81559MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 PCIe On | 00000000:52:00.0 Off | 0 |
| N/A 32C P0 69W / 350W | 661MiB / 81559MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 PCIe On | 00000000:56:00.0 Off | 0 |
| N/A 31C P0 72W / 350W | 661MiB / 81559MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 PCIe On | 00000000:57:00.0 Off | 0 |
| N/A 31C P0 71W / 350W | 621MiB / 81559MiB | 100% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 14485 C /opt/conda/bin/python 3878MiB |
| 1 N/A N/A 14486 C /opt/conda/bin/python 658MiB |
| 2 N/A N/A 14487 C /opt/conda/bin/python 658MiB |
| 3 N/A N/A 14488 C /opt/conda/bin/python 618MiB |
+-----------------------------------------------------------------------------+
Can not see any error from dmesg output either
Does anyone have a clue ?
Hi, I followed the instructions and tried to run benchmark on a server:
Prepare datasets with:
Then tried:
Output:
In the meantime, nvidia-smi looks like this:
Can not see any error from dmesg output either Does anyone have a clue ?