triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

Invalid argument: model input cannot have empty reshape for non-batching model as scalar tensors are not supported for tensorrt_llm #372

Closed mse700 closed 6 months ago

mse700 commented 6 months ago

I have followed the instruction here for deploying llama-2 model. I have prepared the config.pbtxt as pointed out in the Repository, but I get the following error.

Error: Invalid argument: model input cannot have empty reshape for non-batching model as scalar tensors are not supported for tensorrt_llm

tensorrt_llm/config.pbtxt:

`name: "tensorrt_llm" backend: "tensorrtllm" max_batch_size: 0

dynamic_batching { max_queue_delay_microseconds: 3 }

input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ -1 ] allow_ragged_batch: true }, { name: "input_lengths" data_type: TYPE_INT32 dims: [ -1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_INT32 dims: [ 1 ] }, { name: "draft_input_ids" data_type: TYPE_INT32 dims: [ -1 ] optional: true allow_ragged_batch: true }, { name: "end_id" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "pad_id" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true allow_ragged_batch: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true allow_ragged_batch: true }, { name: "embedding_bias" data_type: TYPE_FP32 dims: [ -1 ] optional: true allow_ragged_batch: true }, { name: "beam_width" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_k" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "min_length" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "presence_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "frequency_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_context_logits" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_generation_logits" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "streaming" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, { name: "prompt_vocab_size" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "lora_weights" data_type: TYPE_FP16 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, { name: "lora_config" data_type: TYPE_INT32 dims: [ -1, 3 ] optional: true allow_ragged_batch: true } ] output [ { name: "output_ids" data_type: TYPE_INT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_INT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] }, { name: "context_logits" data_type: TYPE_FP32 dims: [ -1, -1 ] }, { name: "generation_logits" data_type: TYPE_FP32 dims: [ -1, -1, -1 ] } ] instance_group [ { count: 1 kind : KIND_CPU } ] parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: { string_value: "no" } } parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } } parameters: { key: "gpt_model_path" value: { string_value: "/llama2-70b-trtllm/trt_model_repo/tensorrt_llm/1" } } parameters: { key: "batch_scheduler_policy" value: { string_value: "max_utilization" } } ` server log:

I0310 14:46:50.544452 8125 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f2078000000' with size 268435456 I0310 14:46:50.554774 8125 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0310 14:46:50.554782 8125 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 [libprotobuf ERROR /tmp/tritonbuild/tritonserver/build/_deps/repo-third-party-build/grpc-repo/src/grpc/third_party/protobuf/src/google/protobuf/text_format.cc:335] Error parsing text-format inference.ModelConfig: 29:17: Expected integer, got: $ E0310 14:46:50.729215 8125 model_repository_manager.cc:1335] Poll failed for model directory 'tensorrt_llm_bls': failed to read text proto from /llama2-70b-trtllm/trt_model_repo/tensorrt_llm_bls/config.pbtxt I0310 14:46:50.729364 8125 model_lifecycle.cc:469] loading: postprocessing:1 I0310 14:46:50.729438 8125 model_lifecycle.cc:469] loading: preprocessing:1 I0310 14:46:50.729483 8125 model_lifecycle.cc:469] loading: tensorrt_llm:1 [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set E0310 14:46:50.812889 8125 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Invalid argument: model input cannot have empty reshape for non-batching model as scalar tensors are not supported for tensorrt_llm I0310 14:46:50.812913 8125 model_lifecycle.cc:773] failed to load 'tensorrt_llm' /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: UsingTRANSFORMERS_CACHEis deprecated and will be removed in v5 of Transformers. UseHF_HOMEinstead. warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: UsingTRANSFORMERS_CACHEis deprecated and will be removed in v5 of Transformers. UseHF_HOMEinstead. warnings.warn( None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. I0310 14:46:52.404642 8125 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0310 14:46:52.408845 8125 python_be.cc:2381] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: UsingTRANSFORMERS_CACHEis deprecated and will be removed in v5 of Transformers. UseHF_HOMEinstead. warnings.warn( /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: UsingTRANSFORMERS_CACHEis deprecated and will be removed in v5 of Transformers. UseHF_HOME` instead. warnings.warn( None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. I0310 14:46:52.982074 8125 model_lifecycle.cc:835] successfully loaded 'preprocessing' I0310 14:46:53.012704 8125 model_lifecycle.cc:835] successfully loaded 'postprocessing' E0310 14:46:53.012785 8125 model_repository_manager.cc:579] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Invalid argument: model input cannot have empty reshape for non-batching model as scalar tensors are not supported for tensorrt_llm; I0310 14:46:53.012846 8125 server.cc:607] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0310 14:46:53.012891 8125 server.cc:634] +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/back | | | | ends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/back | | | | ends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------+

I0310 14:46:53.012930 8125 server.cc:677] +----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Model | Version | Status | +----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+ | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | UNAVAILABLE: Invalid argument: model input cannot have empty reshape for non-batching model as scalar tensors are not supported for tensor | | | | rt_llm | +----------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+

I0310 14:46:53.013183 8125 metrics.cc:770] Collecting CPU metrics I0310 14:46:53.013297 8125 tritonserver.cc:2508] +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.43.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memor | | | y cuda_shared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /llama2-70b-trtllm/trt_model_repo | | model_control_mode | MODE_NONE | | strict_model_config | 0 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------+

I0310 14:46:53.013321 8125 server.cc:307] Waiting for in-flight requests to complete. I0310 14:46:53.013327 8125 server.cc:323] Timeout 30: Found 0 model versions that have in-flight inferences I0310 14:46:53.013693 8125 server.cc:338] All models are stopped, unloading models I0310 14:46:53.013702 8125 server.cc:347] Timeout 30: Found 2 live models and 0 in-flight non-inference requests I0310 14:46:54.013775 8125 server.cc:347] Timeout 29: Found 2 live models and 0 in-flight non-inference requests Cleaning up... Cleaning up... I0310 14:46:54.339797 8125 model_lifecycle.cc:620] successfully unloaded 'postprocessing' version 1 I0310 14:46:54.352598 8125 model_lifecycle.cc:620] successfully unloaded 'preprocessing' version 1 I0310 14:46:55.013860 8125 server.cc:347] Timeout 28: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models`