Open Juelianqvq opened 8 months ago
Can you share
config.pbtxt
of your backend settingsCan you share
- the script to build engine
- the
config.pbtxt
of your backend settings
I enabled the options with " --dtype float16 --remove_input_padding --use_gpt_attention_plugin float16 --enable_context_fmha --use_gemm_plugin float16 --use_inflight_batching --world_size 2 --tp_size 2 --max_output_len 1024"
and here is the pbtxts: pbtxt.zip
I don't see you setup the beam_width when you build the engine. Can you try adding --max_beam_width 4
?
I don't see you setup the beam_width when you build the engine. Can you try adding
--max_beam_width 4
?
I've added the option and the problem exists. [TensorRT-LLM][ERROR] Encountered error for requestId 1380228878: Cannot process new request: Streaming mode is only supported with beam width of 1. [TensorRT-LLM][ERROR] Cannot process new request: Streaming mode is only supported with beam width of 1.
Thanks. I find the limitation in batch manager that I missed. Could you modify this issue or open another issue to require this feature?
Thanks. I find the limitation in batch manager that I missed. Could you modify this issue or open another issue to require this feature?
OK
When I use perf_analyzer to test the performance, I meet the problem name "Thread [0] had error: Cannot send stop request without specifying a request_id". Do you know how to fix it ?
Try to use perf_analyzer as follows deploying LLaMA2-13B with triton:
python scripts/launch_triton_server.py --world_size 2 --model_repo triton_model_repo perf_analyzer -m ensemble -i grpc --shape "bad_words:1" --shape "max_tokens:1" --shape "stop_words:1" --shape "text_input:1" --streaming
However, I'm encountered with an error which implies the beam_width is not set correctly. Also, I'm a new beginner and curious about the dimension with --shape. Can you give me some suggestions?