Open adogwangwang opened 1 month ago
Hi @adogwangwang the request is taking a long time because you are generating many tokens due to max_length = 4096
. This setting doesn't control the context length of the model and is instead a request-level parameter that decides how many tokens to generate. It is likely you are generating ~4000 tokens, which takes a long time for any engine. If you only want to generate a small amount of tokens, please set max_length
to that number
Your current environment
python3.11 vllm4.1 torch2.21-cu118
🐛 Describe the bug
here is my log with vllm, when inference gemma7b ,it shows 6 logs for one request which spend 30s , why so slow?
I use python3.11 ,vllm 4.1 ,torch 2.21-cu118