Closed edisonchan closed 1 year ago
can you share your "sb deploy" command as well? which Docker image did you use?
can you share your "sb deploy" command as well? which Docker image did you use?
I just re-installed ubuntu 22.04 in fresh and installed docker acording this link:https://linuxhint.com/use-nvidia-gpu-docker-containers-ubuntu-22-04-lts/. This deploy steps is almost same to here: https://microsoft.github.io/superbenchmark/docs/getting-started/run-superbench: (via SSH)
set +H
sb deploy -f local.ini --host-password=mysshpassword
sb deploy -f local.ini -i superbench/superbench:v0.7.0-cuda11.8 --host-password=mysshpassword
sb run -f local.ini -c resnet.yaml --host-password=mysshpassword
The GPU works in sometime(according nvidai-smi report), but sb still say No CUDA GPUs are available in sometime: localhost | CHANGED | rc=0 >> [2023-03-27 03:38:27,121 u22:2171][executor.py:246][INFO] Executor is going to execute matmul. [2023-03-27 03:38:28,959 u22:2171][base.py:179][ERROR] Run benchmark failed - benchmark: pytorch-matmul, message: No CUDA GPUs are available [2023-03-27 03:38:28,960 u22:2171][executor.py:131][INFO] benchmark: pytorch-matmul, return code: 4, result: {'return_code': [4]}. [2023-03-27 03:38:28,960 u22:2171][executor.py:138][ERROR] Executor failed in matmul. by setting deprecation_warnings=False in ansible.cfg.
The GPU works in sometime(according nvidai-smi report), but sb still say No CUDA GPUs are available in sometime
Does "sometime" mean nvidia-smi sometimes can show GPU while sometimes cannot?
Can you also share the full log and resnet.yaml you used? Because the filename says resnet but the error message seems to be pytorch-matmul benchmark.
For the no GPU error, it could be either hardware setup issue or config issue. Because your nvidia-smi shows you only have one GPU in the node, please make sure you have changed all proc_num
from 8 to 1 in the config file. By default the config works on node with 8 GPUs. https://github.com/microsoft/superbenchmark/blob/97c9a41f147b71d3fbe87703af0e892e65f91224/superbench/config/default.yaml#L14
@edisonchan would you please check whether you still have No CUDA GPUs
issue or not? If not, we will close this issue.
@cp5555 The GPU not found problem is gone after change "proc_num" to 1. There is still some problem here, I will open another issue if I can not solve。
What's the issue, what's expected?: The torch inside the docker can not find the my GPU.
How to reproduce it?: Install superbenchmark as normal.
sb run -f local.ini -c resnet.yaml --host-password=mypassword
GPU: Quadro RTX 6000, driver is 530.30.02 nvidia-smi![image](https://user-images.githubusercontent.com/8596506/227720179-6664a5e6-8b91-44a4-af16-d3097c02b4ef.png)
Log message or shapshot?: /opt/conda/lib/python3.8/site-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system.
Additional information: OS: ubuntu 22.04.02
how can I fix this problem?