triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

v0.9.0 tensorrt_llm_bls model return error: Model '${tensorrt_llm_model_name}' is not ready. #469

Closed plt12138 closed 1 month ago

plt12138 commented 1 month ago

System Info

TensorRT-LLM:v0.9.0 tensorrtllm_backend:v0.9.0

Who can help?

@kaiyux

Information

Tasks

Reproduction

  1. set backend config:
    
    python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,preprocessing_instance_count:1,tokenizer_type:auto

python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,postprocessing_instance_count:1,tokenizer_type:auto

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,bls_instance_count:8,accumulate_tokens:False

python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:8

python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,max_beam_width:1,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/,batching_strategy:inflight_batching,kv_cache_free_gpu_mem_fraction:0.5,max_queue_delay_microseconds:0,exclude_input_in_output:True,max_attention_window_size:12288

2. Launch Triton server and get ok

python3 /tensorrtllm_backend/scripts/launch_triton_server.py --grpc_port 9001 --http_port 9000 --metrics_port 9002 --world_size=1 --model_repo=triton_model_repo

+------------------+---------+--------+ | Model | Version | Status | +------------------+---------+--------+ | ensemble | 1 | READY | | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | READY | | tensorrt_llm_bls | 1 | READY | +------------------+---------+--------+ I0521 12:47:40.294420 6079 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:9001 I0521 12:47:40.294862 6079 http_server.cc:4692] Started HTTPService at 0.0.0.0:9000 I0521 12:47:40.338519 6079 http_server.cc:362] Started Metrics Service at 0.0.0.0:9002


3. Query the server with the Triton generate endpoint

curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'



### Expected behavior

BLS model return output.

### actual behavior

{"error":"Traceback (most recent call last):\n  File \"/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/model.py\", line 94, in execute\n    for res in res_gen:\n  File \"/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/decode.py\", line 194, in decode\n    for gen_response in self._generate(preproc_response, request):\n  File \"/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 270, in _generate\n    for r in self._exec_triton_request(triton_req):\n  File \"/tensorrtllm_backend/triton_model_repo/tensorrt_llm_bls/1/lib/triton_decoder.py\", line 130, in _exec_triton_request\n    raise pb_utils.TritonModelException(r.error().message())\nc_python_backend_utils.TritonModelException: Model ${tensorrt_llm_model_name} - Error when running inference: Failed for execute the inference request. Model '${tensorrt_llm_model_name}' is not ready.\n"}

### additional notes

Must set parameters tensorrt_llm_model_name and tensorrt_llm_draft_model_name? How do you use these two parameters?
janpetrov commented 1 month ago

please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding

avianion commented 1 month ago

@janpetrov I am interested in speculative decoding. What do I set the draft model name to?

plt12138 commented 1 month ago

please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding

Yes, the issue is resolved. Thanks.