Open dongteng opened 3 months ago
Hey @dongteng - can you try using https://gitlab-master.nvidia.com/ftp/tekit_backend/-/blob/main/inflight_batcher_llm/client/end_to_end_grpc_client.py with the same bls backend and see if it works? It will help isolate the problem.
@dongteng - can you try using --paged_kv_cache enable
for in-flight batching and setting batching_strategy:inflight_fused_batching
in tensorrt_llm model in triton server setting (config.pbtxt)?
https://github.com/triton-inference-server/tensorrtllm_backend/issues/348#issuecomment-2114744044 As I mentioned on link above, you should turn on in-flight batching strategy for using TRT-LLM + triton server with streaming mode.
System Info
V100*2 nvcr.io/nvidia/tritonserver:24.01-trtllm-python-py3 tensorrt-llm 0.7.0
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I want deploy qwen14B in triton like https://github.com/NVIDIA/TensorRT-LLM/tree/a8018c14e6a9868b507a0517550b2cc6e41bd86e/examples/qwen 1.build engine
python3 build.py --hf_model_dir /root/model_repo \ --dtype float16 \ --remove_input_padding \ --use_gpt_attention_plugin float16 \ --enable_context_fmha \ --use_gemm_plugin float16 \ --output_dir /root/Qwen/14B/trt_engines/fp16/2-gpu \ --world_size 2 \ --tp_size 2
cd /root/Qwen/14B/trt_engines/fp16/2-gpu cp -r ./* /tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/
cd /root/ cp -r model_repo /tensorrtllm_backend/triton_model_repo/tensorrt_llm/ rm /tensorrtllm_backend/triton_model_repo/tensorrt_llm/model_repo/*.safetensors
cd /tensorrtllm_backend python3 scripts/launch_triton_server.py --world_size=2 --model_repo=/tensorrtllm_backend/triton_model_repo
When I
curl -X POST 10.110.31.16:8001/v2/models/tensorrt_llm_bls/generate_stream \ -d '{"text_input": "<|im_start|>system\n you are a writer .<|im_end|>\n<|im_start|>user\nwho are you ?<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 54, "bad_words": "\n", "stop_words": "", "end_id": [151643], "pad_id": [151643],"stream": true }'
It gave me a whole output not stream like![image](https://github.com/triton-inference-server/tensorrtllm_backend/assets/84452216/3f460925-41cf-4150-b63e-bba55de854ba)
the model.py for tensorrt_llm_bls is `mport json import traceback
import numpy as np import triton_python_backend_utils as pb_utils
class TritonPythonModel:
tensorrt-llm-bls config.pbtxt:
name: "tensorrt_llm_bls" backend: "python" max_batch_size: 4model_transaction_policy { decoupled: true }
input [ { name: "text_input" data_type: TYPE_STRING dims: [ -1 ] }, { name: "max_tokens" data_type: TYPE_INT32 dims: [ -1 ] }, { name: "bad_words" data_type: TYPE_STRING dims: [ -1 ] optional: true }, { name: "stop_words" data_type: TYPE_STRING dims: [ -1 ] optional: true }, { name: "end_id" data_type: TYPE_INT32 dims: [ 1 ] optional: true }, { name: "pad_id" data_type: TYPE_INT32 dims: [ 1 ] optional: true }, { name: "top_k" data_type: TYPE_INT32 dims: [ 1 ] optional: true }, { name: "top_p" data_type: TYPE_FP32 dims: [ 1 ] optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] optional: true }, { name: "length_penalty" data_type: TYPE_FP32 dims: [ 1 ] optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] optional: true }, { name: "min_length" data_type: TYPE_INT32 dims: [ 1 ] optional: true }, { name: "presence_penalty" data_type: TYPE_FP32 dims: [ 1 ] optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] optional: true }, { name: "return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "beam_width" data_type: TYPE_INT32 dims: [ 1 ] optional: true }, { name: "stream" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: [ -1, -1 ] optional: true }, { name: "prompt_vocab_size" data_type: TYPE_INT32 dims: [ 1 ] optional: true }, { name: "embedding_bias_words" data_type: TYPE_STRING dims: [ -1 ] optional: true }, { name: "embedding_bias_weights" data_type: TYPE_FP32 dims: [ -1 ] optional: true } ] output [ { name: "text_output" data_type: TYPE_STRING dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] } ]
parameters: { key: "accumulate_tokens" value: { string_value: "true" } }
instance_group [ { count: 2 kind : KIND_CPU } ] `
4.
Expected behavior
expect streaming output but not .
curl -X POST 10.110.31.16:8001/v2/models/tensorrt_llm_bls/generate_stream \ -d '{"text_input": "<|im_start|>system\n you are a writer .<|im_end|>\n<|im_start|>user\nwho are you<|im_end|>\n<|im_start|>assistant\n", "max_tokens": 54, "bad_words": "\n", "stop_words": "", "end_id": [151643], "pad_id": [151643],"stream": true }'
actual behavior
additional notes
when i print sth,
I think the trellm_response is a whole output could not Iteratorable