vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.7k stars 3.4k forks source link

Benchmark_serving.py fails for HuggingFace TGI #1300

Closed dyastremsky closed 4 months ago

dyastremsky commented 9 months ago

I am seeing the below error about max_tokens when I run benchmark_serving.py for HuggingFace TGI. Is there anything else I should be doing?

I started the server with: ./launch_tgi_server.sh facebook/opt-125m. I started the benchmarking script with: python3 benchmark_serving.py --backend tgi --tokenizer facebook/opt-125m --dataset ShareGPT_V3_unfiltered_cleaned_split.json.

MicrosoftTeams-image (1)

linkedlist771 commented 9 months ago

You need to adjust the docker start parameters: max_total_tokens,max_input_length, try to increase them!

dyastremsky commented 9 months ago

Thank you! Do you happen to know what the best values are? I used the default script.

Wouldn't it just not allow prompts that exceed the number of tokens or length rather than providing an error? Otherwise, it would just make sense to set the maximum arbitrarily high, which would not provide an equivalent benchmark (since we'd be running TGI on much longer prompts).

linkedlist771 commented 9 months ago

I happen to know how to solve your problem because I am runing benchmark of different frameworks. The deafault setting of the TGI is quite limited, setting the max length of the input and the total is reasonable. When we building a new LLM, we would first design those parameters first. If you search for the model you are using, you would find those parameters. For example, for gpt-3.5-turbo, its's max_total_tokens could be 4096 for the default one.

As your first question, I have no idea what is the best setting, which highly depends your model and the using cases. Any way I am using vicuna-13b-v1.5 to do benchmark test. The parameters are max_total_tokens:4096, max_input_length:3072. You can check those parameters's meaning and default using(enter the docker container to do this if you use docker ):

text-generation-launcher  --help
dyastremsky commented 9 months ago

Got it. Thank you for your help!

dyastremsky commented 9 months ago

This error actually looks like it's coming from the parameter passed in via the benchmark_serving.py script ("max_new_tokens"). How would changing the arguments to the Docker container help?

It wouldn't be a fair comparison to let vLLM cap the number of tokens returned but then let TGI returns unlimited tokens. It also should not be erroring out with that parameter. How are you benchmarking using this script?

linkedlist771 commented 9 months ago

Actually, it is the initiation parameters in the TGI docker, you can list them with

text-generation-launcher --help

If you want to set the max_new_tokens to 2000, for instance :

model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.1.0 --model-id $model --max_new_tokens 2000

for more detailed information : https://github.com/huggingface/text-generation-inference/blob/main/launcher/src/main.rs

hmellor commented 4 months ago

Closing this issue as stale as there has been no discussion in the past 3 months.

If you are still experiencing the issue you describe, feel free to re-open this issue.