Open TheKidThatCodes opened 2 days ago
to clarify, i would prefer to use vllm instead of llama.cpp python whatever its called, because vllm has incredible performance when it works
You should avoid setting gpu_memory_utilization=1
since some of the GPU is reserved even when idle. To reduce memory cost of vLLM, you can choose a smaller value of max_model_len
and/or max_num_seqs
.
You should avoid setting
gpu_memory_utilization=1
since some of the GPU is reserved even when idle. To reduce memory cost of vLLM, you can choose a smaller value ofmax_model_len
and/ormax_num_seqs
.
what would you recommend to set it to
BTW, you can also set --max-model-len
to reduce memory usage if you don't need full context. it will use model's max output len if you didn't specify. in llama 3.2 case, it's 128K. it cause this error
ERROR 11-30 13:57:07 engine.py:366] The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (90016). Try increasing
gpu_memory_utilizationor decreasing
max_model_lenwhen initializing the engine.
Your current environment
How would you like to use vllm
I want to run inference of a [llama 3.2 3b](put link here). it keeps saying i dont have enough memory
Before submitting a new issue...