mlc-ai / web-llm

High-performance In-browser LLM Inference Engine
https://webllm.mlc.ai
Apache License 2.0
13.38k stars 860 forks source link

Run llama.cpp models #38

Closed MariasStory closed 4 months ago

MariasStory commented 1 year ago

I guess it would be easy for you run the ggml llama.cpp compatible models. In this case, you don't need the GPU and could run the models in memory. From a simple test I find that llama.cpp is faster on 13B model then web-llm on 7B model on the same system. Although, running in GPU migt help: https://github.com/ggerganov/llama.cpp/discussions/915

jinhongyii commented 1 year ago

Really intersting phenomenon. Could you give which GPU you are using and your memory usage when running llama.cpp? I can infer which compression strategy you are using on llama.cpp with the memory usage and it will largely infect the performance.

MariasStory commented 1 year ago

HI, For running the web-llm in browser I've tried a NVIDIA T2000 and a build in i9 intel GPU. The T2000 is fastest in the prompt ingestion at ~17 tokens/s and slowest in generation ~0.6 tokens/s. The build in i9 intel GPU was ~2 tokens/s in both tasks.

The llama.cpp runs on a CPU only and it uses speedup tricks like mmap to store the model in memory. But, for the 13B model the memory use is ~7 GB. On the i9 system with 32GB memory the prompt ingestion is slower then the T2000 but the generation is much faster then both of the GPUs (at ~4 tokens/s).

jinhongyii commented 1 year ago

Got it. We are currently bringing maximum performance on macbook gpu backend, and only guarantee runnability on other backend (including nvidia gpu you are running with). Performance on other backend is one item on our to-do list.

MariasStory commented 1 year ago

Just out of curiosity, what do you mean by "bringing maximum performance on macbook gpu" is it a limitation of TVM? Or, you are using some GPU specific tricks? Any way, there are two parts to my request:

  1. First is about CPU execution and the possibility to run ggml models in browser, similar to the way as it is run by llama.cpp.
  2. The second is a possibility to run the ggml models with GPU in browser. Similar to the topic of the discussion: https://github.com/ggerganov/llama.cpp/discussions/915
jinhongyii commented 1 year ago

For "bringing maximum performance on macbook gpu", you need to know that the same code implementation could have dramatically different performance on different hardwares. So it's impossible to run equally well on all devices. Our choice is to bring best performance for macbook gpu first, and then improve performance on other devices.

For the rest of your question, the answer is sadly "no". The reason is that ggml doesn't have a good enough codegen for webgpu. Even with Triton, it can only codegen for cuda. TVM is designed for deploying on diverse backends, and that's why we choose it as our framework.

But what we can promise is that : we will try to optimize as much as we could so that we could align or outperform llama.cpp on as many devices as possible.