DeepSeek-V2 design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.
Can VLLM support MLA for accelerated inference?
@misc{deepseek-v2,
author = {DeepSeek-AI},
title = {DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model},
year = {2024},
note = {GitHub repository},
url = {https://github.com/deepseek-ai/deepseek-v2}
}
🚀 The feature, motivation and pitch
DeepSeek-V2 design MLA (Multi-head Latent Attention), which utilizes low-rank key-value union compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference.
Can VLLM support MLA for accelerated inference?
@misc{deepseek-v2, author = {DeepSeek-AI}, title = {DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, year = {2024}, note = {GitHub repository}, url = {https://github.com/deepseek-ai/deepseek-v2} }
https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat https://github.com/deepseek-ai/DeepSeek-V2/blob/main/deepseek-v2-tech-report.pdf
Alternatives
No response
Additional context
No response