vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.91k stars 4.12k forks source link

[Feature]: MLA Support #4625

Open chengtbf opened 5 months ago

chengtbf commented 5 months ago

🚀 The feature, motivation and pitch

DeepSeek-V2 design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.

image

Can VLLM support MLA for accelerated inference?

@misc{deepseek-v2, author = {DeepSeek-AI}, title = {DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, year = {2024}, note = {GitHub repository}, url = {https://github.com/deepseek-ai/deepseek-v2} }

https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf

Alternatives

No response

Additional context

No response

datalee commented 5 months ago

mark

obitoquilt commented 4 months ago

mark

chenrui17 commented 3 months ago

mark

lengyueyang commented 2 months ago

mark

RanchiZhao commented 2 months ago

mark

Jiayi-Pan commented 2 months ago

mark

quwu0820 commented 2 months ago

mark

qianchen94 commented 2 months ago

mark

lumosity4tpj commented 1 month ago

mark

zhyncs commented 1 month ago

ref https://github.com/vllm-project/vllm/pull/4650#issuecomment-2297051077