k2-fsa / sherpa

Speech-to-text server framework with next-gen Kaldi
https://k2-fsa.github.io/sherpa
Apache License 2.0
473 stars 97 forks source link

unable to launch Triton server on finetuned whisper model #568

Open StephennFernandes opened 2 months ago

StephennFernandes commented 2 months ago

Hi there, I have been finetuning whisper models using huggingface. Further to convert the model to TensorRT_LLM format, i use a HF script that converts the models from its HF format to the original OpenAI format. i then follow your instructions and convert the OAI model to TensorRT_LLM format. which happens successfully.

however when i follow the further steps on launching the Triton inference server using the launch_server.sh script. i get the following error:

I0408 20:10:02.904273 254 server.cc:345] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models

the following is the stack trace of the entire log post launching the bash script.

I0408 20:09:55.322866 254 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7c5bd2000000' with size 2048000000
I0408 20:09:55.324786 254 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 4096000000
I0408 20:09:55.330786 254 model_lifecycle.cc:461] loading: whisper:1
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022700
I0408 20:09:58.959624 254 python_be.cc:2362] TRITONBACKEND_ModelInstanceInitialize: whisper_0_0 (CPU device 0)
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024022700
[04/08/2024-20:10:01] [TRT] [E] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild.
[04/08/2024-20:10:01] [TRT] [E] 2: [engine.cpp::deserializeEngine::1148] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
I0408 20:10:02.021843 254 pb_stub.cc:346] Failed to initialize Python stub: AttributeError: 'NoneType' object has no attribute 'create_execution_context'

At:
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(67): _init
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(80): from_serialized_engine
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(53): get_session
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(33): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(201): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(52): init_model
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(45): initialize

E0408 20:10:02.773438 254 backend_model.cc:691] ERROR: Failed to create instance: AttributeError: 'NoneType' object has no attribute 'create_execution_context'

At:
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(67): _init
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(80): from_serialized_engine
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(53): get_session
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(33): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(201): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(52): init_model
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(45): initialize

E0408 20:10:02.773573 254 model_lifecycle.cc:630] failed to load 'whisper' version 1: Internal: AttributeError: 'NoneType' object has no attribute 'create_execution_context'

At:
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(67): _init
  /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(80): from_serialized_engine
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(53): get_session
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(33): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(201): __init__
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(52): init_model
  /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/model.py(45): initialize

I0408 20:10:02.773614 254 model_lifecycle.cc:765] failed to load 'whisper'
I0408 20:10:02.773752 254 server.cc:606] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0408 20:10:02.773824 254 server.cc:633] 
+---------+---------------------------------------------------+---------------------------------------------------+
| Backend | Path                                              | Config                                            |
+---------+---------------------------------------------------+---------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_pytho | {"cmdline":{"auto-complete-config":"true","backen |
|         | n.so                                              | d-directory":"/opt/tritonserver/backends","min-co |
|         |                                                   | mpute-capability":"6.000000","default-max-batch-s |
|         |                                                   | ize":"4"}}                                        |
|         |                                                   |                                                   |
+---------+---------------------------------------------------+---------------------------------------------------+

I0408 20:10:02.773886 254 server.cc:676] 
+---------+---------+---------------------------------------------------------------------------------------------+
| Model   | Version | Status                                                                                      |
+---------+---------+---------------------------------------------------------------------------------------------+
| whisper | 1       | UNAVAILABLE: Internal: AttributeError: 'NoneType' object has no attribute 'create_execution |
|         |         | _context'                                                                                   |
|         |         |                                                                                             |
|         |         | At:                                                                                         |
|         |         |   /usr/local/lib/python3.10/dist-packages/tensorrt_llm/runtime/session.py(80): from_seriali |
|         |         | zed_engine                                                                                  |
|         |         |   /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtll |
|         |         | m/whisper/1/whisper_trtllm.py(33): __init__                                                 |
|         |         |   /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtll |
|         |         | m/whisper/1/model.py(52): init_model                                                        |
|         |         |   /workspace/TensorRT-LLM/examples/whisper/sherpa/triton/whisper/./model_repo_whisper_trtllm/whisper/1/whisper_trtllm.py(33): __init__ |
+---------+---------+---------------------------------------------------------------------------------------------+

I0408 20:10:02.888804 254 metrics.cc:877] Collecting metrics for GPU 0: NVIDIA RTX A6000
I0408 20:10:02.903932 254 metrics.cc:770] Collecting CPU metrics
I0408 20:10:02.904225 254 tritonserver.cc:2498] 
+----------------------------------+------------------------------------------------------------------------------+
| Option                           | Value                                                                        |
+----------------------------------+------------------------------------------------------------------------------+
| server_id                        | triton                                                                       |
| server_version                   | 2.42.0                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) |
|                                  |  schedule_policy model_configuration system_shared_memory cuda_shared_memory |
|                                  |  binary_tensor_data parameters statistics trace logging                      |
| model_repository_path[0]         | ./model_repo_whisper_trtllm                                                  |
| model_control_mode               | MODE_NONE                                                                    |
| strict_model_config              | 0                                                                            |
| rate_limit                       | OFF                                                                          |
| pinned_memory_pool_byte_size     | 2048000000                                                                   |
| cuda_memory_pool_byte_size{0}    | 4096000000                                                                   |
| min_supported_compute_capability | 6.0                                                                          |
| strict_readiness                 | 1                                                                            |
| exit_timeout                     | 30                                                                           |
| cache_enabled                    | 0                                                                            |
+----------------------------------+------------------------------------------------------------------------------+

I0408 20:10:02.904255 254 server.cc:307] Waiting for in-flight requests to complete.
I0408 20:10:02.904262 254 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences
I0408 20:10:02.904268 254 server.cc:338] All models are stopped, unloading models
I0408 20:10:02.904273 254 server.cc:345] Timeout 30: Found 0 live models and 0 in-flight non-inference requests
error: creating server: Internal - failed to load all models
csukuangfj commented 2 months ago

@yuekaizhang Could you have a look?

yuekaizhang commented 2 months ago

The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild.

@StephennFernandes Seems you build engines and run engines in different envs. Would you mind building and runnning in the same docker container e.g. soar97/triton-whisper:24.01.complete?

StephennFernandes commented 2 months ago

@yuekaizhang i got it working thanks a ton for your assistance. also. noticed that we cannot do inference for longer audio files. beyond 30s

yuekaizhang commented 2 months ago

@yuekaizhang i got it working thanks a ton for your assistance. also. noticed that we cannot do inference for longer audio files. beyond 30s

@StephennFernandes Since whisper could only process audios smaller than 30s, you need to implement a VAD segmenter like this project https://github.com/shashikg/WhisperS2T/tree/main. Welcome to contribute :D

StephennFernandes commented 2 months ago

@yuekaizhang thanks for the heads up. already on it.

StephennFernandes commented 2 months ago

@yuekaizhang hey not like this error done anything bad for the deployment. as far as i have seen my triton deployment works fine. but i have this weird error log that pops up when i deploy my model. do you seem to know what it means ?

I0412 08:38:16.910492 1416 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x76d5da000000' with size 2048000000
I0412 08:38:16.911524 1416 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 4096000000
I0412 08:38:16.915451 1416 model_lifecycle.cc:469] loading: whisper:1
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
free(): invalid pointer
[user-DSA7TGX-424R:01427] *** Process received signal ***
[user-DSA7TGX-424R:01427] Signal: Aborted (6)
[user-DSA7TGX-424R:01427] Signal code:  (-6)
[user-DSA7TGX-424R:01427] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7b8114a16520]
[user-DSA7TGX-424R:01427] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7b8114a6a9fc]
[user-DSA7TGX-424R:01427] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7b8114a16476]
[user-DSA7TGX-424R:01427] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7b81149fc7f3]
[user-DSA7TGX-424R:01427] [ 4] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x89676)[0x7b8114a5d676]
[user-DSA7TGX-424R:01427] [ 5] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa0cfc)[0x7b8114a74cfc]
[user-DSA7TGX-424R:01427] [ 6] /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa2a44)[0x7b8114a76a44]
[user-DSA7TGX-424R:01427] [ 7] /usr/lib/x86_64-linux-gnu/libc.so.6(free+0x73)[0x7b8114a79453]
[user-DSA7TGX-424R:01427] [ 8] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x70064)[0x64e9f6599064]
[user-DSA7TGX-424R:01427] [ 9] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x25e13)[0x64e9f654ee13]
[user-DSA7TGX-424R:01427] [10] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7b81149fdd90]
[user-DSA7TGX-424R:01427] [11] /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7b81149fde40]
[user-DSA7TGX-424R:01427] [12] /opt/tritonserver/backends/python/triton_python_backend_stub(+0x26b75)[0x64e9f654fb75]
[user-DSA7TGX-424R:01427] *** End of error message ***
I0412 08:38:22.776815 1416 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: whisper_0_0 (CPU device 0)
[TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024040200
I0412 08:38:30.125002 1416 model_lifecycle.cc:835] successfully loaded 'whisper'
I0412 08:38:30.125259 1416 server.cc:607]