Is AirLLM faster than llama.cpp?

Dear Lyogavin,

Thanks for your wonderful work. I have a question about, does AirLLM run faster than llama.cpp? Do you have any data on that?

As I know, llama.cpp uses mmap to manage memory. When computation meets page faults, mmap automatically loads tensor weights from disk to memory and continue computation, and it also unloads less-used tensor weights when the memory load is high, all managed by the OS. So llama.cpp also supports very large LLMs, like the feature AirLLM provides.

I noticed that AirLLM uses prefetching to overlap disk IO latency and computation, will this be faster than llama.cpp (with mmap enabled)? And how much is the improvement?

lyogavin / airllm

Is AirLLM faster than llama.cpp? #206