runpod-workers / worker-vllm

The RunPod worker template for serving our large language model endpoints. Powered by vLLM.
MIT License
213 stars 82 forks source link

feat: add `max_model_length` setup key #35

Closed willsamu closed 7 months ago

willsamu commented 7 months ago

Fixes this message received when trying to run a mistral-7b finetune:

ValueError: The model's max seq len (32768) is larger than the maximum number of tokens that can be stored in KV cache (29568). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

gpu_memory_utilization is already set to 0.98 by default. With this PR MAX_MODEL_LEN can be set as environment variable. Expects an integer, defaults to 4096.

bartlettD commented 7 months ago

This is pretty useful to have, some models like Capybara-34B have ludicrous context lengths that are difficult to fit in a workers VRAM.

Using this stopped me hitting Out Of Memory errors in those cases.

alpayariyak commented 7 months ago

Thank you for your feedback and taking the time to make this, this feature has now been added in the latest update!