v0.9.0 tensorrt_llm_bls model return error: Model '${tensorrt_llm_model_name}' is not ready.

plt12138 commented 1 month ago

System Info

TensorRT-LLM：v0.9.0 tensorrtllm_backend：v0.9.0

Who can help?

@kaiyux

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

set backend config:


python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,preprocessing_instance_count:1,tokenizer_type:auto

python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,postprocessing_instance_count:1,tokenizer_type:auto

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,bls_instance_count:8,accumulate_tokens:False

python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:8

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,max_beam_width:1,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/,batching_strategy:inflight_batching,kv_cache_free_gpu_mem_fraction:0.5,max_queue_delay_microseconds:0,exclude_input_in_output:True,max_attention_window_size:12288

2. Launch Triton server and get ok

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --grpc_port 9001 --http_port 9000 --metrics_port 9002 --world_size=1 --model_repo=triton_model_repo


3. Query the server with the Triton generate endpoint

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'



### Expected behavior

BLS model return output.

### actual behavior

{"error":"Traceback (most recent call last):\n  File \"/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/model.py\", line 94, in execute\n    for res in res_gen:\n  File \"/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/decode.py\", line 194, in decode\n    for gen_response in self._generate(preproc_response, request):\n  File \"/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 270, in _generate\n    for r in self._exec_triton_request(triton_req):\n  File \"/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 130, in _exec_triton_request\n    raise pb_utils.TritonModelException(r.error().message())\nc_python_backend_utils.TritonModelException: Model ${tensorrt_llm_model_name} - Error when running inference: Failed for execute the inference request. Model '${tensorrt_llm_model_name}' is not ready.\n"}

### additional notes

Must set parameters tensorrt_llm_model_name and tensorrt_llm_draft_model_name? How do you use these two parameters?

janpetrov commented 1 month ago

please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding

avianion commented 1 month ago

@janpetrov I am interested in speculative decoding. What do I set the draft model name to?

plt12138 commented 1 month ago

please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding

Yes, the issue is resolved. Thanks.

triton-inference-server / tensorrtllm_backend