mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 549 forks source link

only 1 GPU working while training of "Recommandation" #567

Closed longerzone closed 1 year ago

longerzone commented 2 years ago

as the title, I'm running recommendation scene of training, here is my device info in the docker bash:

host:/workspace/recommendation# python
Python 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34)
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'torch' is not defined
>>> import torch
>>> torch.cuda.is_available()
True
>>> torch.cuda.device_count()
8
>>> torch.cuda.get_device_name(0)
'Tesla V100-SXM2-32GB'
>>>

But I observed that only one GPU is in the working state in the DLRM training stage( only GPU0 is working in the whole training stage): image

and I found no any parameter for gpu number control in the run_and_time.sh or ncf.py script, have any suggestion for this?

johntran-nv commented 1 year ago

Hi @longerzone , it looks like you're running the older ncf benchmark. We're on DLRM now (and are actually working on a new version of that as well). Could you try TOT DLRM instead? Feel free to open another issue if you have problems with DLRM. Thanks.