vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.65k stars 4.08k forks source link

[Feature]: Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference #3549

Open tchaton opened 6 months ago

tchaton commented 6 months ago

🚀 The feature, motivation and pitch

This paper might be of interest: https://arxiv.org/pdf/2403.09636.pdf

Alternatives

No response

Additional context

No response

condy0919 commented 4 weeks ago

According to the paper,

In our experiments, we equip pre-existing LLMs—such as Llama 2 (Touvron et al., 2023) 7B, 13B, and 70B—with DMC by retrofitting them on a negligible percentage of the original pre-training data (~2% for 2× compression, and ~8% for 8× compression) and without adding any extra parameters to the original LLM.

The required data is large IMHO.