triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.29k stars 1.48k forks source link

ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. #5931

Closed zengqi0730 closed 1 year ago

zengqi0730 commented 1 year ago

I use NGC container tritonserver:23.05-py3 Running the start command: docker run --rm -it nvcr.io/nvidia/tritonserver:23.05-py3 bash

but it shows that ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[ OS call failed or operation not supported on this OS (error 304) ]]

some information show in pic below:

2023-06-12_14-28-30

I test the docker image on V100 and RTX2080Ti, and the driver is updated to 525. The same situation occurs. Besides, I found someone in the same situation as me on the forum, but there was no official response: https://forums.developer.nvidia.com/t/cuda-driver-initialization-failed/255858

krishung5 commented 1 year ago

Hi @zengqi0730, could you add --gpus all to the docker run command and check if the error still occurs?

zengqi0730 commented 1 year ago

Hi @zengqi0730, could you add --gpus all to the docker run command and check if the error still occurs?

The same error occurs! I doubt if it is because the latest 530 driver corresponding to CUDA 012.1 was not used, but the latest release version on the official website is only 525 2023-06-13_09-49-03

zhanwenchen commented 1 year ago

Could this be a 22.04.2 issue? I solved a similar issue where nvidia-driver-525 had to be recompiled against the recent Ubuntu 22.04.2 kernel updates even for the same driver version. Also, maybe try upgrading to 530?

yuekaizhang commented 1 year ago

Same issue for 23.06.

dangvansam commented 1 year ago

Use correct version of Docker Container with Ubuntu and CUDA version : https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel-23-06.html#rel-23-06 image

Sencc commented 1 year ago

I also encountered the same problem, the cause of this problem is because my machine suddenly shut down. My solution was to first stop the docker container, then restart the machine, and finally create a new docker container. It works for me.

coderchem commented 1 year ago

@dangvansam Is it solved?

bigpieit commented 1 year ago

I got same error with TF docker image nvcr.io/nvidia/tensorflow:23.07-tf2-py3. And i already used the latest nvidia driver 535.86.10

dangvansam commented 1 year ago

@dangvansam Is it solved?

Yes, it work for me

bigpieit commented 1 year ago

@dangvansam Is it solved?

Yes, it work for me

can you help elaborate your fix?

geekbeast commented 1 year ago

tl;dr sudo docker run --privileged --gpus all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/host/path/to/models:/models nvcr.io/nvidia/tritonserver:23.07-py3 tritonserver --model-repository=/models. Adding the --privileged flag will allow the failing mmap operation to complete.

Troubleshooting journey:

I used a python script that allowed me to invoke cuInit directly from python. This narrowed the error down to : RuntimeError: cuInit failed with error code 304: OS call failed or operation not supported on this OS

For some reason the server is unable to pin memory, which triggers the OS call failed error. Apparently, this is occasionally solved by reboot, but I've had no such luck so I did some more digging on the particular failure coming from mmap (operation not permitted), which led me to https://stackoverflow.com/questions/8213671/mmap-operation-not-permitted.

I checked my configuration and I had CONFIG_STRICT_DEVMEM=y for me, but I also wasn't seeing the same errors outside of docker, so it made me think it was something related to the privileges that docker was running with. I tried running with docker run --privileged and it worked!

This is a fine solution for development, but the description of what --privileged allows means it is a non-starter for production (not sure if nvidia folks expect that as use case).

SamuraiBUPT commented 1 year ago

I also encountered the same problem, the cause of this problem is because my machine suddenly shut down. My solution was to first stop the docker container, then restart the machine, and finally create a new docker container. It works for me.

Same, I solved this problem by restarting the machine: sudo reboot .

It's said that 99% problems can be solved by restarting the maching.

:)

dyastremsky commented 1 year ago

Thank you all for your solutions and contributions! Closing due to inactivity.

kaustubhcs commented 8 months ago

I got the same issue, adding --privileged solved the issue to my docker command. Final docker command:

docker run --privileged --gpus all -it --rm \
    --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
    -v /home/llama.cpp:/root/git/llama.cpp \
    nvcr.io/nvidia/pytorch:24.01-py3
iibw commented 8 months ago

Same here. I hope a workaround can be found before entering production.

sudo docker run --privileged --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
kartikpodugu commented 4 months ago

Hi @zengqi0730, could you add --gpus all to the docker run command and check if the error still occurs?

My docker command has '--gpus all' already. Still I see the problem. Please suggest what to do

kartikpodugu commented 4 months ago

Same issue for 23.06.

Same issue for 24.04. Please help

Grisly00 commented 3 months ago

I have the same problem running docker rootless. Users do not have the right to use the --ulimit flag. This my command:

docker run --rm --gpus all --shm-size=32gb --privileged --gpus all nvcr.io/nvidia/pytorch:23.05-py3 nvidia-smi gets me:

NVIDIA Release 23.05 (build 60708168)
PyTorch Version 2.0.0
...
ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ Unknown error (error 999) ]]

Any ideas?

MehdiTantaoui-99 commented 3 weeks ago

Hello all,

I am facing the same issue, I tried the reboot and --privileged options but getting the same error:

docker run --privileged --gpus all -it --rm \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
nvcr.io/nvidia/tritonserver:24.09-py3   tritonserver --model-repository=/models
NVIDIA Release 24.09 (build 112408254)
Triton Server Version 2.50.0

Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ No CUDA-capable device is detected (error 100) ]]

This is my driver:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:0A:00.0 Off |                   On |
| N/A   29C    P0             31W /  165W |       1MiB /  24576MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+
zyle0 commented 1 week ago

Hello all,

I am facing the same issue, I tried the reboot and --privileged options but getting the same error:

docker run --privileged --gpus all -it --rm \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
nvcr.io/nvidia/tritonserver:24.09-py3   tritonserver --model-repository=/models
NVIDIA Release 24.09 (build 112408254)
Triton Server Version 2.50.0

Copyright (c) 2018-2024, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

ERROR: The NVIDIA Driver is present, but CUDA failed to initialize.  GPU functionality will not be available.
   [[ No CUDA-capable device is detected (error 100) ]]

This is my driver:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A30                     Off |   00000000:0A:00.0 Off |                   On |
| N/A   29C    P0             31W /  165W |       1MiB /  24576MiB |     N/A      Default |
|                                         |                        |              Enabled |
+-----------------------------------------+------------------------+----------------------+

Similar issue,but error is

ERROR: The NVIDIA Driver is present, but CUDA failed to initialize. GPU functionality will not be available. [[ Named symbol not found (error 500) ]