Closed zengqi0730 closed 1 year ago
Hi @zengqi0730, could you add --gpus all
to the docker run command and check if the error still occurs?
Hi @zengqi0730, could you add
--gpus all
to the docker run command and check if the error still occurs?
The same error occurs! I doubt if it is because the latest 530 driver corresponding to CUDA 012.1 was not used, but the latest release version on the official website is only 525
Could this be a 22.04.2 issue? I solved a similar issue where nvidia-driver-525
had to be recompiled against the recent Ubuntu 22.04.2 kernel updates even for the same driver version. Also, maybe try upgrading to 530?
Same issue for 23.06.
Use correct version of Docker Container with Ubuntu and CUDA version : https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-23-06.html#rel-23-06
I also encountered the same problem, the cause of this problem is because my machine suddenly shut down. My solution was to first stop the docker container, then restart the machine, and finally create a new docker container. It works for me.
@dangvansam Is it solved?
I got same error with TF docker image nvcr.io/nvidia/tensorflow:23.07-tf2-py3. And i already used the latest nvidia driver 535.86.10
@dangvansam Is it solved?
Yes, it work for me
@dangvansam Is it solved?
Yes, it work for me
can you help elaborate your fix?
tl;dr sudo docker run --privileged --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/host/path/to/models:/models nvcr.io/nvidia/tritonserver:23.07-py3 tritonserver --model-repository=/models
. Adding the --privileged
flag will allow the failing mmap operation to complete.
Troubleshooting journey:
I used a python script that allowed me to invoke cuInit directly from python. This narrowed the error down to : RuntimeError: cuInit failed with error code 304: OS call failed or operation not supported on this OS
For some reason the server is unable to pin memory, which triggers the OS call failed error. Apparently, this is occasionally solved by reboot, but I've had no such luck so I did some more digging on the particular failure coming from mmap (operation not permitted), which led me to https://stackoverflow.com/questions/8213671/mmap-operation-not-permitted.
I checked my configuration and I had CONFIG_STRICT_DEVMEM=y for me, but I also wasn't seeing the same errors outside of docker, so it made me think it was something related to the privileges that docker was running with. I tried running with docker run --privileged and it worked!
This is a fine solution for development, but the description of what --privileged allows means it is a non-starter for production (not sure if nvidia folks expect that as use case).
I also encountered the same problem, the cause of this problem is because my machine suddenly shut down. My solution was to first stop the docker container, then restart the machine, and finally create a new docker container. It works for me.
Same, I solved this problem by restarting the machine: sudo reboot
.
It's said that 99% problems can be solved by restarting the maching.
:)
Thank you all for your solutions and contributions! Closing due to inactivity.
I got the same issue, adding --privileged
solved the issue to my docker command.
Final docker command:
docker run --privileged --gpus all -it --rm \
--ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /home/llama.cpp:/root/git/llama.cpp \
nvcr.io/nvidia/pytorch:24.01-py3
Same here. I hope a workaround can be found before entering production.
sudo docker run --privileged --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
Hi @zengqi0730, could you add
--gpus all
to the docker run command and check if the error still occurs?
My docker command has '--gpus all' already. Still I see the problem. Please suggest what to do
Same issue for 23.06.
Same issue for 24.04. Please help
I have the same problem running docker rootless. Users do not have the right to use the --ulimit flag. This my command:
docker run --rm --gpus all --shm-size=32gb --privileged --gpus all nvcr.io/nvidia/pytorch:23.05-py3 nvidia-smi
gets me:
NVIDIA Release 23.05 (build 60708168)
PyTorch Version 2.0.0
...
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available.
[[ Unknown error (error 999) ]]
Any ideas?
Hello all,
I am facing the same issue, I tried the reboot and --privileged
options but getting the same error:
docker run --privileged --gpus all -it --rm \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
nvcr.io/nvidia/tritonserver:24.09-py3 tritonserver --model-repository=/models
NVIDIA Release 24.09 (build 112408254)
Triton Server Version 2.50.0
Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available.
[[ No CUDA-capable device is detected (error 100) ]]
This is my driver:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A30 Off | 00000000:0A:00.0 Off | On |
| N/A 29C P0 31W / 165W | 1MiB / 24576MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
Hello all,
I am facing the same issue, I tried the reboot and
--privileged
options but getting the same error:docker run --privileged --gpus all -it --rm \ -p 8000:8000 -p 8001:8001 -p 8002:8002 \ nvcr.io/nvidia/tritonserver:24.09-py3 tritonserver --model-repository=/models
NVIDIA Release 24.09 (build 112408254) Triton Server Version 2.50.0 Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[ No CUDA-capable device is detected (error 100) ]]
This is my driver:
+-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA A30 Off | 00000000:0A:00.0 Off | On | | N/A 29C P0 31W / 165W | 1MiB / 24576MiB | N/A Default | | | | Enabled | +-----------------------------------------+------------------------+----------------------+
Similar issue,but error is
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[ Named symbol not found (error 500) ]
I use NGC container tritonserver:23.05-py3 Running the start command: docker run --rm -it nvcr.io/nvidia/tritonserver:23.05-py3 bash
but it shows that ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[ OS call failed or operation not supported on this OS (error 304) ]]
some information show in pic below:
I test the docker image on V100 and RTX2080Ti, and the driver is updated to 525. The same situation occurs. Besides, I found someone in the same situation as me on the forum, but there was no official response: https://forums.developer.nvidia.com/t/cuda-driver-initialization-failed/255858