We can get the expected longer output text when increasing the --output-len number bigger than 711.
actual behavior
When the --output-len is bigger than 711, the output text length will not increase, but if I
using this python script https://github.com/NVIDIA/TensorRT-LLM/blob/v0.9.0/examples/run.py, the output text length still increases even I increase the --output-len to 1024(the output text length is 5065):
python run.py --tokenizer_dir /lpai/volumes/cloudmodel-muses/lt/models/Llama-2-7b-chat-hf/ --engine_dir=/lpai/trt_engines/llama/7B/trt_engines/fp16/1-gpu/ --max_output_len=1024 --input_text='What is machine learning?'
System Info
Who can help?
@byshiue @schetlur-nv
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
end_to_end_grpc_client.py
Expected behavior
We can get the expected longer output text when increasing the
--output-len
number bigger than711
.actual behavior
When the
--output-len
is bigger than711
, the output text length will not increase, but if I using this python script https://github.com/NVIDIA/TensorRT-LLM/blob/v0.9.0/examples/run.py, the output text length still increases even I increase the--output-len
to1024
(the output text length is5065
):additional notes
no