lyogavin / airllm

AirLLM 70B inference with single 4GB GPU
Apache License 2.0
5.28k stars 423 forks source link

Is AirLLM faster than llama.cpp? #206

Open Lizonghang opened 12 hours ago

Lizonghang commented 12 hours ago

Dear Lyogavin,

Thanks for your wonderful work. I have a question about, does AirLLM run faster than llama.cpp? Do you have any data on that?

As I know, llama.cpp uses mmap to manage memory. When computation meets page faults, mmap automatically loads tensor weights from disk to memory and continue computation, and it also unloads less-used tensor weights when the memory load is high, all managed by the OS. So llama.cpp also supports very large LLMs, like the feature AirLLM provides.

I noticed that AirLLM uses prefetching to overlap disk IO latency and computation, will this be faster than llama.cpp (with mmap enabled)? And how much is the improvement?