Closed plt12138 closed 1 month ago
please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding
@janpetrov I am interested in speculative decoding. What do I set the draft model name to?
please set "tensorrt_llm_model_name" to "tensorrt_llm" you do not need to touch tensorrt_llm_draft_model_name, unless you are interested in speculative decoding
Yes, the issue is resolved. Thanks.
System Info
TensorRT-LLM:v0.9.0 tensorrtllm_backend:v0.9.0
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python3 tools/fill_template.py -i triton_model_repo/postprocessing/config.pbtxt tokenizer_dir:/tensorrtllm_backend/tokenizer/,triton_max_batch_size:8,postprocessing_instance_count:1,tokenizer_type:auto
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,bls_instance_count:8,accumulate_tokens:False
python3 tools/fill_template.py -i triton_model_repo/ensemble/config.pbtxt triton_max_batch_size:8
python3 tools/fill_template.py -i triton_model_repo/tensorrt_llm/config.pbtxt triton_max_batch_size:8,decoupled_mode:True,max_beam_width:1,engine_dir:/tensorrtllm_backend/triton_model_repo/tensorrt_llm/1/,batching_strategy:inflight_batching,kv_cache_free_gpu_mem_fraction:0.5,max_queue_delay_microseconds:0,exclude_input_in_output:True,max_attention_window_size:12288
python3 /tensorrtllm_backend/scripts/launch_triton_server.py --grpc_port 9001 --http_port 9000 --metrics_port 9002 --world_size=1 --model_repo=triton_model_repo
+------------------+---------+--------+ | Model | Version | Status | +------------------+---------+--------+ | ensemble | 1 | READY | | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | READY | | tensorrt_llm_bls | 1 | READY | +------------------+---------+--------+ I0521 12:47:40.294420 6079 grpc_server.cc:2463] Started GRPCInferenceService at 0.0.0.0:9001 I0521 12:47:40.294862 6079 http_server.cc:4692] Started HTTPService at 0.0.0.0:9000 I0521 12:47:40.338519 6079 http_server.cc:362] Started Metrics Service at 0.0.0.0:9002
curl -X POST localhost:8000/v2/models/tensorrt_llm_bls/generate -d '{"text_input": "What is machine learning?", "max_tokens": 20, "bad_words": "", "stop_words": ""}'