triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.3k stars 1.48k forks source link

Loading ONNX model fails because of insufficient CUDA driver version #4346

Closed janjagusch closed 2 years ago

janjagusch commented 2 years ago

Description

I'm trying to load the densenet_onnx example into Trition v22.04.

Upon startup, I get the following error message:

# Truncated for readability, full traceback below ...
I0506 10:50:30.170242 1 server.cc:619]
+---------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| Model         | Version | Status                                                                                                                                            |
+---------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------+
| densenet_onnx | 1       | UNAVAILABLE: Internal: onnx runtime error 1: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(E |
|               |         | RRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /workspace/onnxruntime/onnxruntime/core/prov |
|               |         | iders/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool T |
|               |         | HRW = true] CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=32539 ; hostname=3d35a7f61dd1 ; expr=cudaSetDevic |
|               |         | e(info_.device_id);                                                                                                                               |
|               |         |                                                                                                                                                   |
+---------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------+

I don't want to run my model on the GPU and hence believe that I don't need CUDA drivers in the first place. Is it possible to use an ONNX runtime version that only uses the CPU and hence doesn't call CUDA?

Triton Information

I'm using nvcr.io/nvidia/tritonserver:22.04-py3.

To Reproduce

In an empty directory:

mkdir -p model_repository/densenet_onnx/1
wget -O model_repository/densenet_onnx/1/model.onnx \
     https://contentmamluswest001.blob.core.windows.net/content/14b2744cf8d6418c87ffddc3f3127242/9502630827244d60a1214f250e3bbca7/08aed7327d694b8dbaee2c97b8d0fcba/densenet121-1.2.onnx

docker run \
    --interactive \
    --tty \
    --shm-size 1500000000 \
    --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models \
    nvcr.io/nvidia/tritonserver:22.04-py3 tritonserver \
        --model-repository=/models \
        --strict-model-config=false
Full traceback ``` ============================= == Triton Inference Server == ============================= NVIDIA Release 22.04 (build 36821869) Triton Server Version 2.21.0 Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support; see https://docs.nvidia.com/datacenter/cloud-native/ . WARNING: [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35 I0506 10:59:05.291016 1 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch I0506 10:59:05.291274 1 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9 I0506 10:59:05.291290 1 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9 2022-05-06 10:59:05.457743: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0 I0506 10:59:05.501305 1 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow I0506 10:59:05.501395 1 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9 I0506 10:59:05.501417 1 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9 I0506 10:59:05.501436 1 tensorflow.cc:2221] backend configuration: {} I0506 10:59:05.503125 1 onnxruntime.cc:2400] TRITONBACKEND_Initialize: onnxruntime I0506 10:59:05.503202 1 onnxruntime.cc:2410] Triton TRITONBACKEND API version: 1.9 I0506 10:59:05.503231 1 onnxruntime.cc:2416] 'onnxruntime' TRITONBACKEND API version: 1.9 I0506 10:59:05.503252 1 onnxruntime.cc:2446] backend configuration: {} I0506 10:59:05.518030 1 openvino.cc:1207] TRITONBACKEND_Initialize: openvino I0506 10:59:05.518106 1 openvino.cc:1217] Triton TRITONBACKEND API version: 1.9 I0506 10:59:05.518125 1 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.9 W0506 10:59:05.518162 1 pinned_memory_manager.cc:236] Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version I0506 10:59:05.518201 1 cuda_memory_manager.cc:115] CUDA memory pool disabled I0506 10:59:05.529323 1 model_repository_manager.cc:1077] loading: densenet_onnx:1 I0506 10:59:05.637889 1 onnxruntime.cc:2481] TRITONBACKEND_ModelInitialize: densenet_onnx (version 1) I0506 10:59:05.932762 1 onnxruntime.cc:2504] TRITONBACKEND_ModelFinalize: delete model state E0506 10:59:05.932885 1 model_repository_manager.cc:1234] failed to load 'densenet_onnx' version 1: Internal: onnx runtime error 1: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=32553 ; hostname=8518a28ba867 ; expr=cudaSetDevice(info_.device_id); I0506 10:59:05.933914 1 server.cc:549] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+ I0506 10:59:05.933966 1 server.cc:576] +-------------+-------------------------------------------------------------------------+--------+ | Backend | Path | Config | +-------------+-------------------------------------------------------------------------+--------+ | pytorch | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so | {} | | tensorflow | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so | {} | | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {} | | openvino | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {} | +-------------+-------------------------------------------------------------------------+--------+ I0506 10:59:05.934006 1 server.cc:619] +---------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | Model | Version | Status | +---------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------+ | densenet_onnx | 1 | UNAVAILABLE: Internal: onnx runtime error 1: /workspace/onnxruntime/onnxruntime/core/providers/cuda/cuda_call.cc:122 bool onnxruntime::CudaCall(E | | | | RRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool THRW = true] /workspace/onnxruntime/onnxruntime/core/prov | | | | iders/cuda/cuda_call.cc:116 bool onnxruntime::CudaCall(ERRTYPE, const char*, const char*, ERRTYPE, const char*) [with ERRTYPE = cudaError; bool T | | | | HRW = true] CUDA failure 35: CUDA driver version is insufficient for CUDA runtime version ; GPU=32553 ; hostname=8518a28ba867 ; expr=cudaSetDevic | | | | e(info_.device_id); | | | | | +---------------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------+ I0506 10:59:05.934159 1 tritonserver.cc:2123] +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.21.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda | | | _shared_memory binary_tensor_data statistics trace | | model_repository_path[0] | /models | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | response_cache_byte_size | 0 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+ I0506 10:59:05.934192 1 server.cc:250] Waiting for in-flight requests to complete. I0506 10:59:05.934201 1 server.cc:266] Timeout 30: Found 0 model versions that have in-flight inferences I0506 10:59:05.934208 1 server.cc:281] All models are stopped, unloading models I0506 10:59:05.934218 1 server.cc:288] Timeout 30: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models ```

I'm running this on MacBook Pro with macOS 12.2, 2,3 GHz Quad-Core Intel Core i7, and Intel Iris Plus Graphics 1536 MB.

Expected behavior

I expect Triton to load the ONN model.

rmccorm4 commented 2 years ago

Hi @janjagusch ,

Can you try to force CPU as described here by adding something like the following to your densenet_onnx config.pbtxt:

  instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
  ]
janjagusch commented 2 years ago

Hi @janjagusch ,

Can you try to force CPU as described here by adding something like the following to your densenet_onnx config.pbtxt:

  instance_group [
    {
      count: 1
      kind: KIND_CPU
    }
  ]

thanks for helping out, @rmccorm4. unfortunately, even after applying your suggestion the same error remains. you can reproduce it yourself this way:

mkdir -p model_repository/densenet_onnx/1
wget -O model_repository/densenet_onnx/1/model.onnx \
     https://contentmamluswest001.blob.core.windows.net/content/14b2744cf8d6418c87ffddc3f3127242/9502630827244d60a1214f250e3bbca7/08aed7327d694b8dbaee2c97b8d0fcba/densenet121-1.2.onnx
cat > model_repository/densenet_onnx/config.pbtxt <<- EOM
instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]
EOM

docker run \
    --interactive \
    --tty \
    --shm-size 1500000000 \
    --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models \
    nvcr.io/nvidia/tritonserver:22.04-py3 tritonserver \
        --model-repository=/models \
        --strict-model-config=false
tanmayv25 commented 2 years ago

@janjagusch This is a known issue. We will have the fix soon.

CC: @nv-kmcgill53

nv-kmcgill53 commented 2 years ago

This fix will be in the 22.05 release.