vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.52k stars 3.7k forks source link

PowerInfer : using a combination of cpu and gpu for faster Inference #2212

Open nivibilla opened 8 months ago

nivibilla commented 8 months ago

Splitting hot and cold neurons across cpu and gpu allows faster Inference when using larger models/higher quantisations. Demo shows 11x speedup over llama.cpp when using a 40b on a single 24gb GPU.

Demo https://twitter.com/omarsar0/status/1737168751668187229?t=blU8xZMb7JMJTtAHra7zvQ&s=19

GitHub https://github.com/SJTU-IPADS/PowerInfer

Wondering if this is something that can also be integrated into vllm.

i-amgeek commented 8 months ago

It is designed to improve speed of mainly sparse LLMs. It won't allow faster inference with dense LLMs.

nivibilla commented 8 months ago

They show sparsity even in dense models like falcon. But I guess mixtral MoE is a better candidate

libratiger commented 8 months ago

It is designed to improve speed of mainly sparse LLMs. It won't allow faster inference with dense LLMs.

but there still many LLMs using the ReLU activation function, so this can still have a chance?