[Usage]: What do max_num_seqs and max_model_len do

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

29.72k stars 4.49k forks source link

[Usage]: What do max_num_seqs and max_model_len do #6641

Open highheart opened 3 months ago

highheart commented 3 months ago

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

Can someone help me explain what the max_num_seqs and max_model_len parameters do? At what stage do these two parameters operate? When I set the following engine parameters: { "model": "/model", "tensor_parallel_size": 8, "tokenizer_mode": "auto", "trust_remote_code": true, "dtype": "auto", "gpu_memory_utilization": 0.95, "max_num_seqs": 256, "max_model_len": 8192, "enforce_eager": true } The model can still handle an input length of around 16291, calculated using len(prompt)

derpyhue commented 3 months ago

https://docs.vllm.ai/en/latest/models/performance.html Decrease max_num_seqs or max_num_batched_tokens. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.

https://docs.vllm.ai/en/latest/models/engine_args.html --max-model-len Model context length. If unspecified, will be automatically derived from the model config.

highheart commented 3 months ago

https://docs.vllm.ai/en/latest/models/performance.html Decrease max_num_seqs or max_num_batched_tokens. This can reduce the number of concurrent requests in a batch, thereby requiring less KV cache space.

https://docs.vllm.ai/en/latest/models/engine_args.html --max-model-len Model context length. If unspecified, will be automatically derived from the model config.

Thank you for your reply. If max-model-len is set to 8192, is the latest 8192 characters always set as context? That is, no matter how long the user input is, vllm always intercepts the length of the max-model-len processing behind it.

derpyhue commented 3 months ago

From what i understand yes. When you exceed the max-model-len it will output a error. Truncating would be needed in a chat use case.

If your memory bound and not model bound for max-model-len. There are many ways to lower memory usage: --enforce-eager (Disables CUDA graphs.) Lowers tokens/s --kv-cache-dtype fp8 (can increase kvcache but it lowers speed significantly in my case) Lowers tokens/s but can affect output. --max_num_seqs (Lowering it can give you a edge in memory saving) How many request it can run parallel

highheart commented 3 months ago

From what i understand yes. When you exceed the max-model-len it will output a error. Truncating would be needed in a chat use case.

If your memory bound and not model bound for max-model-len. There are many ways to lower memory usage: --enforce-eager (Disables CUDA graphs.) Lowers tokens/s --kv-cache-dtype fp8 (can increase kvcache but it lowers speed significantly in my case) Lowers tokens/s but can affect output. --max_num_seqs (Lowering it can give you a edge in memory saving) How many request it can run parallel

What I have observed is that the user's input length longer than max-model-len does not report an error. Can you provide an error message in this case

tgilewicz commented 1 month ago

From my observations the library does report an error (version 0.6.1.post2)

Example:

Input prompt (5138 tokens) is too long and exceeds limit of 4096