mlc-ai / mlc-llm

Universal LLM Deployment Engine with ML Compilation
https://llm.mlc.ai/
Apache License 2.0
19.09k stars 1.56k forks source link

[Bug] KVCache takes up too much memory when running mlc_llm.serve.server #2026

Closed gxmlfx closed 7 months ago

gxmlfx commented 7 months ago

🐛 Bug

KVCache takes up too much memory when running mlc_llm.serve.server, but memory usage is normal when using cli or mlc_llm.gradio

To Reproduce

Steps to reproduce the behavior:

Expected behavior

That's the KVCache takes when using Gradio.

INFO model_metadata.py:96: Total memory usage: 4077.14 MB (Parameters: 3615.13 MB. KVCache: 0.00 MB. Temporary buffer: 462.01 MB)

Environment

Additional context

MasterJH5574 commented 7 months ago

Hi @gxmlfx, this is intentional because the automatic estimation wants to maximize the KV cache capacity for serving. To limit the GPU memory taken by KV cache, you can manually specify --max-total-seq-length with a smaller value. Or you can set the environment variable MLC_GPU_SIZE_BYTES to the number of bytes you want the server to use in total (including parameters, KV cache, temporary buffers, etc.).

https://github.com/mlc-ai/mlc-llm/blob/f04cd3e9e81bcd3c02015df6fe0f0eaa9ffd8453/python/mlc_llm/serve/engine.py#L208-L215

Thank you for the feedback! We will provide detailed documentation regarding this.

gxmlfx commented 7 months ago

Hi @gxmlfx, this is intentional because the automatic estimation wants to maximize the KV cache capacity for serving. To limit the GPU memory taken by KV cache, you can manually specify --max-total-seq-length with a smaller value. Or you can set the environment variable MLC_GPU_SIZE_BYTES to the number of bytes you want the server to use in total (including parameters, KV cache, temporary buffers, etc.).您好,这是有意为之的,因为自动估计希望最大化 KV 缓存容量来提供服务。为了限制KV缓存占用的GPU内存,您可以手动指定较小的值 --max-total-seq-length 。或者你可以将环境变量 MLC_GPU_SIZE_BYTES 设置为你希望服务器总共使用的字节数(包括参数、KV缓存、临时缓冲区等)。

https://github.com/mlc-ai/mlc-llm/blob/f04cd3e9e81bcd3c02015df6fe0f0eaa9ffd8453/python/mlc_llm/serve/engine.py#L208-L215

Thank you for the feedback! We will provide detailed documentation regarding this.感谢您的反馈!我们将提供与此相关的详细文档。

@MasterJH5574 Thanks! That works well, I changed int(gpu_size_bytes) 0.90 to int(gpu_size_bytes) 0.50, in https://github.com/mlc-ai/mlc-llm/blob/f04cd3e9e81bcd3c02015df6fe0f0eaa9ffd8453/python/mlc_llm/serve/engine.py#L217-L225 Appriciate for upcoming new document.