Closed gxmlfx closed 7 months ago
Hi @gxmlfx, this is intentional because the automatic estimation wants to maximize the KV cache capacity for serving. To limit the GPU memory taken by KV cache, you can manually specify --max-total-seq-length
with a smaller value. Or you can set the environment variable MLC_GPU_SIZE_BYTES
to the number of bytes you want the server to use in total (including parameters, KV cache, temporary buffers, etc.).
Thank you for the feedback! We will provide detailed documentation regarding this.
Hi @gxmlfx, this is intentional because the automatic estimation wants to maximize the KV cache capacity for serving. To limit the GPU memory taken by KV cache, you can manually specify
--max-total-seq-length
with a smaller value. Or you can set the environment variableMLC_GPU_SIZE_BYTES
to the number of bytes you want the server to use in total (including parameters, KV cache, temporary buffers, etc.).您好,这是有意为之的,因为自动估计希望最大化 KV 缓存容量来提供服务。为了限制KV缓存占用的GPU内存,您可以手动指定较小的值--max-total-seq-length
。或者你可以将环境变量MLC_GPU_SIZE_BYTES
设置为你希望服务器总共使用的字节数(包括参数、KV缓存、临时缓冲区等)。Thank you for the feedback! We will provide detailed documentation regarding this.感谢您的反馈!我们将提供与此相关的详细文档。
@MasterJH5574 Thanks! That works well, I changed int(gpu_size_bytes) 0.90 to int(gpu_size_bytes) 0.50, in https://github.com/mlc-ai/mlc-llm/blob/f04cd3e9e81bcd3c02015df6fe0f0eaa9ffd8453/python/mlc_llm/serve/engine.py#L217-L225 Appriciate for upcoming new document.
🐛 Bug
KVCache takes up too much memory when running mlc_llm.serve.server, but memory usage is normal when using cli or mlc_llm.gradio
To Reproduce
Steps to reproduce the behavior:
I tried reduce prefill-chunk-size here, still no help.
Expected behavior
That's the KVCache takes when using Gradio.
Environment
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))"
, applicable if you compile models):Additional context