microsoft / superbenchmark

A validation and profiling tool for AI infrastructure
https://aka.ms/superbench
MIT License
214 stars 51 forks source link

Found no NVIDIA driver on your system. #498

Closed edisonchan closed 1 year ago

edisonchan commented 1 year ago

What's the issue, what's expected?: The torch inside the docker can not find the my GPU.

How to reproduce it?: Install superbenchmark as normal.

sb run -f local.ini -c resnet.yaml --host-password=mypassword

GPU: Quadro RTX 6000, driver is 530.30.02 nvidia-smi image

Log message or shapshot?: /opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system.

Additional information: OS: ubuntu 22.04.02

how can I fix this problem?

abuccts commented 1 year ago

can you share your "sb deploy" command as well? which Docker image did you use?

edisonchan commented 1 year ago

can you share your "sb deploy" command as well? which Docker image did you use?

I just re-installed ubuntu 22.04 in fresh and installed docker acording this link:https://linuxhint.com/use-nvidia-gpu-docker-containers-ubuntu-22-04-lts/. This deploy steps is almost same to here: https://microsoft.github.io/superbenchmark/docs/getting-started/run-superbench: (via SSH)

set +H
sb deploy -f local.ini --host-password=mysshpassword
sb deploy -f local.ini -i superbench/superbench:v0.7.0-cuda11.8 --host-password=mysshpassword
sb run -f local.ini -c resnet.yaml --host-password=mysshpassword

The GPU works in sometime(according nvidai-smi report), but sb still say No CUDA GPUs are available in sometime: localhost | CHANGED | rc=0 >> [2023-03-27 03:38:27,121 u22:2171][executor.py:246][INFO] Executor is going to execute matmul. [2023-03-27 03:38:28,959 u22:2171][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-matmul, message: No CUDA GPUs are available [2023-03-27 03:38:28,960 u22:2171][executor.py:131][INFO] benchmark: pytorch-matmul, return code: 4, result: {'return_code': [4]}. [2023-03-27 03:38:28,960 u22:2171][executor.py:138][ERROR] Executor failed in matmul. by setting deprecation_warnings=False in ansible.cfg.

abuccts commented 1 year ago

The GPU works in sometime(according nvidai-smi report), but sb still say No CUDA GPUs are available in sometime

Does "sometime" mean nvidia-smi sometimes can show GPU while sometimes cannot?

Can you also share the full log and resnet.yaml you used? Because the filename says resnet but the error message seems to be pytorch-matmul benchmark.

For the no GPU error, it could be either hardware setup issue or config issue. Because your nvidia-smi shows you only have one GPU in the node, please make sure you have changed all proc_num from 8 to 1 in the config file. By default the config works on node with 8 GPUs. https://github.com/microsoft/superbenchmark/blob/97c9a41f147b71d3fbe87703af0e892e65f91224/superbench/config/default.yaml#L14

cp5555 commented 1 year ago

@edisonchan would you please check whether you still have No CUDA GPUs issue or not? If not, we will close this issue.

edisonchan commented 1 year ago

@cp5555 The GPU not found problem is gone after change "proc_num" to 1. There is still some problem here, I will open another issue if I can not solve。