Closed sywangyi closed 1 day ago
👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck
CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck
build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo
or khluu
to add you in our Buildkite org.
Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.
To run CI, PR reviewers can do one of these:
ready
label to the PR🚀
python benchmark_serving.py --backend tgi --model Qwen/Qwen2-7B-Instruct --trust-remote-code --dataset-name random --random-input-len 2048 --random-output-len 2048 --request-rate inf --num-prompts 100 --endpoint /generate_stream --host 0.0.0.0 --port 8080 take above command as example, if random-input-len 2048, it will pick 2048 random token ids to feed into tokenizer.decode to generate input prompt. this issue is that if you feed the input_prompt to tokenizer.encode, It may generate different token ids set. which may be more than prompt_len set in the command
@ywang96 please help review
It works:
2024-11-21T07:50:55.886738Z INFO generate_stream{parameters=GenerateParameters { best_of: Some(1), temperature: Some(0.01), repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: Some(0.99), typical_p: None, do_sample: true, max_new_tokens: Some(2048), return_full_text: None, stop: [], truncate: Some(2048), watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None, adapter_id: None } total_time="20.321773711s" validation_time="91.322646ms" queue_time="18.54581341s" inference_time="1.684637916s" time_per_token="93.590995ms" seed="Some(12714082149378288295)"}: text_generation_router::server: router/src/server.rs:621: Success
Thanks for this fix! @sywangyi
Just so I understand the problem correctly, is this because the tokenizer from the TGI server encodes the text prompt to a larger number than
prompt_len
, so we're asking TGI to truncate it on the server-side?
yes. I assume not only tgi but also other backend may have such issue as well. I did an experiment,pick some token ids randomly,tokenizer decode to text,then encode the text to token ids. the output token ids may not be same with the original ones.
Thanks for this fix! @sywangyi Just so I understand the problem correctly, is this because the tokenizer from the TGI server encodes the text prompt to a larger number than
prompt_len
, so we're asking TGI to truncate it on the server-side?yes. I assume not only tgi but also other backend may have such issue as well. I did an experiment,pick some token ids randomly,tokenizer decode to text,then encode the text to token ids. the output token ids may not be same with the original ones.
I see. I guess realistically speaking unless all backends directly accept token IDs as input there's no way we guarantee that the tok & detok matches. This fix looks fair to me.
causing error like ""
inputs
tokens +max_new_tokens
must be <= xxx. Given: xxxinputs
tokens and xxxmax_new_tokens
"