triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

Config issue whilst spinning up the server with Falcon #86

Open harryjulian opened 11 months ago

harryjulian commented 11 months ago

I've followed a mixture of the tutorial for building Falcon here and for spinning up on the triton inference server here.

I'm currently getting some odd errors when trying to launch the server from within the nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 container.

Launch command:

tritonserver --model-repository=/opt/tritonserver/inflight_batcher_llm

Error:

 Invalid argument: ensemble 'ensemble' depends on 'postprocessing' which has no loaded version. Model 'postprocessing' loading failed with error: version 1 is at UNAVAILABLE state: Internal: KeyError: <class 'transformers.models.falcon.configuration_falcon.FalconConfig'>

At:
  /usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py(674): __getitem__
  /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(718): from_pretrained
  /opt/tritonserver/inflight_batcher_llm/postprocessing/1/model.py(65): initialize
;
I1102 13:03:40.144460 2036 server.cc:592] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I1102 13:03:40.144514 2036 server.cc:619] 
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Backend     | Path                                                            | Config                                                                                                                                                        |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
| python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
+-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1102 13:03:40.144605 2036 server.cc:662] 
+----------------+---------+---------------------------------------------------------------------------------------------------------------+
| Model          | Version | Status                                                                                                        |
+----------------+---------+---------------------------------------------------------------------------------------------------------------+
| postprocessing | 1       | UNAVAILABLE: Internal: KeyError: <class 'transformers.models.falcon.configuration_falcon.FalconConfig'>       |
|                |         |                                                                                                               |
|                |         | At:                                                                                                           |
|                |         |   /usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py(674): __getitem__          |
|                |         |   /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(718): from_pretrained |
|                |         |   /opt/tritonserver/inflight_batcher_llm/postprocessing/1/model.py(65): initialize                            |
| preprocessing  | 1       | UNAVAILABLE: Internal: KeyError: <class 'transformers.models.falcon.configuration_falcon.FalconConfig'>       |
|                |         |                                                                                                               |
|                |         | At:                                                                                                           |
|                |         |   /usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py(674): __getitem__          |
|                |         |   /usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py(718): from_pretrained |
|                |         |   /opt/tritonserver/inflight_batcher_llm/preprocessing/1/model.py(69): initialize                             |
| tensorrt_llm   | 1       | READY                                                                                                         |
+----------------+---------+---------------------------------------------------------------------------------------------------------------+

I1102 13:03:40.199393 2036 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A10G
I1102 13:03:40.199694 2036 metrics.cc:710] Collecting CPU metrics
I1102 13:03:40.199888 2036 tritonserver.cc:2458] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.39.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /opt/tritonserver/inflight_batcher_llm                                                                                                                                                                          |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I1102 13:03:40.199943 2036 server.cc:293] Waiting for in-flight requests to complete.
I1102 13:03:40.199953 2036 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences
I1102 13:03:40.200028 2036 server.cc:324] All models are stopped, unloading models
I1102 13:03:40.200041 2036 server.cc:331] Timeout 30: Found 1 live models and 0 in-flight non-inference requests
I1102 13:03:40.301918 2036 model_lifecycle.cc:603] successfully unloaded 'tensorrt_llm' version 1
I1102 13:03:41.200144 2036 server.cc:331] Timeout 29: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
byshiue commented 11 months ago

What's version of your transformers? Can you try installing transformers==4.33.1? It looks a bug of transformers because it fails on loading tokenzier.

harryjulian commented 11 months ago

Tried it in a fresh container, doesn't fix anything unfortunately! Getting the same error. Let me know if there's more detail I can provide.

byshiue commented 11 months ago

What transformer version do you use?

harryjulian commented 11 months ago

I was originally using 4.34.1, and also tried the version you suggested.