triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.4k stars 1.49k forks source link

Failed to stat file model.onxx while using conda-pack in configs #7531

Open Spectra456 opened 3 months ago

Spectra456 commented 3 months ago

Description Hi, I'm trying to use conda-packs inside my configs.pbtxt for Python backends. When I add them, they load successfully, but the model with backend: "onnxruntime" reports the following error:

E0815 11:26:46.113871 1624 model_lifecycle.cc:626] failed to load 'embedder' version 1: Internal: failed to stat file server/embedder/1/model.onnx Without conda-packs in config.pbtxt, Triton Server and the conda environment work without any problems. I tried adding the conda environment to the ONNX Runtime model configuration, but it didn't change anything. There are no issues with the path to the model. I also tried using the default_model_filenameoption, but it didn’t help. Triton Information Now I'm using 23.07, tried to use 24.07, didn't change anything.

Are you using the Triton container or did you build it yourself? Default container + conda with libs for python-backend models.

To Reproduce Here is my config:

name: "embedder"
backend: "onnxruntime"

version_policy {
  specific {
    versions: 1
  }
}

max_batch_size: 256
input [
  {
    name: "waveform"
    data_type: TYPE_FP32
    dims: [ -1 ] # num_mel_bins
  }
]

output [
  {
    name: "embs"
    data_type: TYPE_FP32
    dims: [ 256 ] # [embedding_size]
  }
]

dynamic_batching {
    preferred_batch_size: [ 16, 32, 64, 128, 256]
  }
instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

Here is logs:

Click to view logs
I0815 11:48:13.401738 1872 libtorch.cc:2507] TRITONBACKEND_Initialize: pytorch
I0815 11:48:13.401803 1872 libtorch.cc:2517] Triton TRITONBACKEND API version: 1.13
I0815 11:48:13.401825 1872 libtorch.cc:2523] 'pytorch' TRITONBACKEND API version: 1.13
I0815 11:48:14.912483 1872 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f4f9c000000' with size 268435456
I0815 11:48:14.913378 1872 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864
I0815 11:48:14.918040 1872 model_lifecycle.cc:462] loading: vad:1
I0815 11:48:14.918167 1872 model_lifecycle.cc:462] loading: clusterer:1
I0815 11:48:14.918301 1872 model_lifecycle.cc:462] loading: embedder:1
I0815 11:48:14.918438 1872 model_lifecycle.cc:462] loading: diarization:1
I0815 11:48:14.927406 1872 onnxruntime.cc:2514] TRITONBACKEND_Initialize: onnxruntime
I0815 11:48:14.927524 1872 onnxruntime.cc:2524] Triton TRITONBACKEND API version: 1.13
I0815 11:48:14.927599 1872 onnxruntime.cc:2530] 'onnxruntime' TRITONBACKEND API version: 1.13
I0815 11:48:14.927673 1872 onnxruntime.cc:2560] backend configuration:
{"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}}
I0815 11:48:14.928461 1872 python_be.cc:1746] Using Python execution env /ali/diarization/diarization_upgrade/diarization-sdk/diarization-triton.tar.gz
I0815 11:48:14.928848 1872 python_be.cc:1746] Using Python execution env /ali/diarization/diarization_upgrade/diarization-sdk/diarization-triton.tar.gz
I0815 11:48:14.979405 1872 onnxruntime.cc:2625] TRITONBACKEND_ModelInitialize: embedder (version 1)
I0815 11:48:14.980237 1872 onnxruntime.cc:692] skipping model configuration auto-complete for 'embedder': inputs and outputs already specified
I0815 11:48:14.981187 1872 python_be.cc:1746] Using Python execution env /ali/diarization/diarization_upgrade/diarization-sdk/diarization-triton.tar.gz
I0815 11:48:14.981270 1872 onnxruntime.cc:2686] TRITONBACKEND_ModelInstanceInitialize: embedder_0 (GPU device 0)
I0815 11:48:14.982179 1872 onnxruntime.cc:2738] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0815 11:48:14.982238 1872 onnxruntime.cc:2666] TRITONBACKEND_ModelFinalize: delete model state
E0815 11:48:14.982289 1872 model_lifecycle.cc:626] failed to load 'embedder' version 1: Internal: failed to stat file server/embedder/1/model.onnx
I0815 11:48:14.982334 1872 model_lifecycle.cc:755] failed to load 'embedder'
I0815 11:48:53.389924 1872 python_be.cc:2108] TRITONBACKEND_ModelInstanceInitialize: diarization_0 (GPU device 0)
I0815 11:48:53.404604 1872 python_be.cc:2108] TRITONBACKEND_ModelInstanceInitialize: vad_0 (GPU device 0)
I0815 11:48:53.807442 1872 python_be.cc:2108] TRITONBACKEND_ModelInstanceInitialize: clusterer (GPU device 0)
I0815 11:48:55.223670 1872 model_lifecycle.cc:817] successfully loaded 'diarization'
I0815 11:48:55.296275 1872 model_lifecycle.cc:817] successfully loaded 'vad'
I0815 11:48:56.105169 1872 model_lifecycle.cc:817] successfully loaded 'clusterer'
I0815 11:48:56.105613 1872 server.cc:604] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0815 11:48:56.106054 1872 server.cc:631] 
+-------------+----------------------------------------------------+----------------------------------------------------+
| Backend     | Path                                               | Config                                             |
+-------------+----------------------------------------------------+----------------------------------------------------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytor | {}                                                 |
|             | ch.so                                              |                                                    |
| python      | /opt/tritonserver/backends/python/libtriton_python | {"cmdline":{"auto-complete-config":"true","backend |
|             | .so                                                | -directory":"/opt/tritonserver/backends","min-comp |
|             |                                                    | ute-capability":"6.000000","default-max-batch-size |
|             |                                                    | ":"4"}}                                            |
|             |                                                    |                                                    |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_o | {"cmdline":{"auto-complete-config":"true","backend |
|             | nnxruntime.so                                      | -directory":"/opt/tritonserver/backends","min-comp |
|             |                                                    | ute-capability":"6.000000","default-max-batch-size |
|             |                                                    | ":"4"}}                                            |
|             |                                                    |                                                    |
+-------------+----------------------------------------------------+----------------------------------------------------+

I0815 11:48:56.106480 1872 server.cc:674] 
+-------------+---------+-------------------------------------------------------------------------+
| Model       | Version | Status                                                                  |
+-------------+---------+-------------------------------------------------------------------------+
| clusterer   | 1       | READY                                                                   |
| diarization | 1       | READY                                                                   |
| embedder    | 1       | UNAVAILABLE: Internal: failed to stat file server/embedder/1/model.onnx |
| vad         | 1       | READY                                                                   |
+-------------+---------+-------------------------------------------------------------------------+

I0815 11:48:56.129178 1872 metrics.cc:810] Collecting metrics for GPU 0: Tesla V100-PCIE-16GB
I0815 11:48:56.130549 1872 metrics.cc:703] Collecting CPU metrics
I0815 11:48:56.130875 1872 tritonserver.cc:2415] 
+----------------------------------+------------------------------------------------------------------------------------+
| Option                           | Value                                                                              |
+----------------------------------+------------------------------------------------------------------------------------+
| server_id                        | triton                                                                             |
| server_version                   | 2.36.0                                                                             |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) sched |
|                                  | ule_policy model_configuration system_shared_memory cuda_shared_memory binary_tens |
|                                  | or_data parameters statistics trace logging                                        |
| model_repository_path[0]         | server                                                                             |
| model_control_mode               | MODE_NONE                                                                          |
| strict_model_config              | 0                                                                                  |
| rate_limit                       | OFF                                                                                |
| pinned_memory_pool_byte_size     | 268435456                                                                          |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                           |
| min_supported_compute_capability | 6.0                                                                                |
| strict_readiness                 | 1                                                                                  |
| exit_timeout                     | 30                                                                                 |
| cache_enabled                    | 0                                                                                  |
+----------------------------------+------------------------------------------------------------------------------------+

I0815 11:48:56.130934 1872 server.cc:305] Waiting for in-flight requests to complete.
I0815 11:48:56.130973 1872 server.cc:321] Timeout 30: Found 0 model versions that have in-flight inferences
I0815 11:48:56.131290 1872 server.cc:336] All models are stopped, unloading models
I0815 11:48:56.131360 1872 server.cc:343] Timeout 30: Found 3 live models and 0 in-flight non-inference requests
I0815 11:48:57.131507 1872 server.cc:343] Timeout 29: Found 3 live models and 0 in-flight non-inference requests
I0815 11:48:57.729374 1872 model_lifecycle.cc:608] successfully unloaded 'vad' version 1
I0815 11:48:57.730000 1872 model_lifecycle.cc:608] successfully unloaded 'diarization' version 1
I0815 11:48:57.867292 1872 model_lifecycle.cc:608] successfully unloaded 'clusterer' version 1
I0815 11:48:58.131692 1872 server.cc:343] Timeout 28: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
  

Expected behavior Using conda-packs in the Python backend shouldn't affect ONNX Runtime models.

Spectra456 commented 3 months ago

UPD. If I switch from GPU to CPU inference for onnxruntime, it works. UPD.2. Converted model from onnx to tensorRT, error still is same.