vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.66k stars 4.65k forks source link

fix the issue that len(tokenizer(prompt)["input_ids"]) > prompt_len #10524

Closed sywangyi closed 1 day ago

sywangyi commented 1 day ago

causing error like ""inputs tokens + max_new_tokens must be <= xxx. Given: xxx inputs tokens and xxx max_new_tokens"

github-actions[bot] commented 1 day ago

👋 Hi! Thank you for contributing to the vLLM project. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

🚀

sywangyi commented 1 day ago

python benchmark_serving.py --backend tgi --model Qwen/Qwen2-7B-Instruct --trust-remote-code --dataset-name random --random-input-len 2048 --random-output-len 2048 --request-rate inf --num-prompts 100 --endpoint /generate_stream --host 0.0.0.0 --port 8080 take above command as example, if random-input-len 2048, it will pick 2048 random token ids to feed into tokenizer.decode to generate input prompt. this issue is that if you feed the input_prompt to tokenizer.encode, It may generate different token ids set. which may be more than prompt_len set in the command

sywangyi commented 1 day ago

image

sywangyi commented 1 day ago

@ywang96 please help review

MingxuZh commented 1 day ago

It works: image

2024-11-21T07:50:55.886738Z INFO generate_stream{parameters=GenerateParameters { best_of: Some(1), temperature: Some(0.01), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.99), typical_p: None, do_sample: true, max_new_tokens: Some(2048), return_full_text: None, stop: [], truncate: Some(2048), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None } total_time="20.321773711s" validation_time="91.322646ms" queue_time="18.54581341s" inference_time="1.684637916s" time_per_token="93.590995ms" seed="Some(12714082149378288295)"}: text_generation_router::server: router/src/server.rs:621: Success

sywangyi commented 1 day ago

Thanks for this fix! @sywangyi

Just so I understand the problem correctly, is this because the tokenizer from the TGI server encodes the text prompt to a larger number than prompt_len, so we're asking TGI to truncate it on the server-side?

yes. I assume not only tgi but also other backend may have such issue as well. I did an experiment,pick some token ids randomly,tokenizer decode to text,then encode the text to token ids. the output token ids may not be same with the original ones.

ywang96 commented 1 day ago

Thanks for this fix! @sywangyi Just so I understand the problem correctly, is this because the tokenizer from the TGI server encodes the text prompt to a larger number than prompt_len, so we're asking TGI to truncate it on the server-side?

yes. I assume not only tgi but also other backend may have such issue as well. I did an experiment,pick some token ids randomly,tokenizer decode to text,then encode the text to token ids. the output token ids may not be same with the original ones.

I see. I guess realistically speaking unless all backends directly accept token IDs as input there's no way we guarantee that the tok & detok matches. This fix looks fair to me.