Open highheart opened 3 months ago
https://docs.vllm.ai/en/latest/models/performance.html Decrease max_num_seqs or max_num_batched_tokens. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
https://docs.vllm.ai/en/latest/models/engine_args.html --max-model-len Model context length. If unspecified, will be automatically derived from the model config.
https://docs.vllm.ai/en/latest/models/performance.html Decrease max_num_seqs or max_num_batched_tokens. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.
https://docs.vllm.ai/en/latest/models/engine_args.html --max-model-len Model context length. If unspecified, will be automatically derived from the model config.
Thank you for your reply. If max-model-len is set to 8192, is the latest 8192 characters always set as context? That is, no matter how long the user input is, vllm always intercepts the length of the max-model-len processing behind it.
From what i understand yes. When you exceed the max-model-len it will output a error. Truncating would be needed in a chat use case.
If your memory bound and not model bound for max-model-len. There are many ways to lower memory usage: --enforce-eager (Disables CUDA graphs.) Lowers tokens/s --kv-cache-dtype fp8 (can increase kvcache but it lowers speed significantly in my case) Lowers tokens/s but can affect output. --max_num_seqs (Lowering it can give you a edge in memory saving) How many request it can run parallel
From what i understand yes. When you exceed the max-model-len it will output a error. Truncating would be needed in a chat use case.
If your memory bound and not model bound for max-model-len. There are many ways to lower memory usage: --enforce-eager (Disables CUDA graphs.) Lowers tokens/s --kv-cache-dtype fp8 (can increase kvcache but it lowers speed significantly in my case) Lowers tokens/s but can affect output. --max_num_seqs (Lowering it can give you a edge in memory saving) How many request it can run parallel
What I have observed is that the user's input length longer than max-model-len does not report an error. Can you provide an error message in this case
From my observations the library does report an error (version 0.6.1.post2)
Example:
Input prompt (5138 tokens) is too long and exceeds limit of 4096
Your current environment
How would you like to use vllm
Can someone help me explain what the max_num_seqs and max_model_len parameters do? At what stage do these two parameters operate? When I set the following engine parameters: { "model": "/model", "tensor_parallel_size": 8, "tokenizer_mode": "auto", "trust_remote_code": true, "dtype": "auto", "gpu_memory_utilization": 0.95, "max_num_seqs": 256, "max_model_len": 8192, "enforce_eager": true } The model can still handle an input length of around 16291, calculated using len(prompt)