ngxson / wllama

WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
https://huggingface.co/spaces/ngxson/wllama
MIT License
444 stars 23 forks source link

performance expectations #4

Open chadkirby opened 7 months ago

chadkirby commented 7 months ago

First, thanks for putting this project together!

I modified examples/basic/index.html to use a more capable model: https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/resolve/main/gemma-2b-it-q4_k_m.gguf, which is 1.5gb.

Using LM Studio on my laptop (with GPU Acceleration disabled), I get roughly 25 tokens per second from gemma-2b-it-q4_k_m.gguf.

Running examples/basic/index.html in Chrome 124 on my laptop, I get roughly 6-7 tokens per second from gemma-2b-it-q4_k_m.gguf. (Similar performance in Edge 123.)

Generally, the wasm bindings seem roughly 3-4x slower than native. Is that more or less expected? Are there any wllama knobs I can twiddle to improve performance?

ngxson commented 7 months ago

It is expected, since WebAssembly SIMD only support the equivalent to AVX instruction, not AVX2. This should be the biggest impact to performance atm.

Another issue is that we're using emscripten's non-native exception handler which maintains support with older browsers, but come with a small performance cost. We may move to native exception handler in the future.

Edit: seems like most mainstream versions of browsers already support native wasm exception (see here), so it's safe to enable it. The support will be added in the next build of wllama.

ngxson commented 7 months ago

v1.6.0 is now using native exception handler via -fwasm-exceptions. Here is the matrix for browser support: https://webassembly.org/features/

iSuslov commented 6 months ago

Hey @chadkirby, out of curiosity, have you tried on latest version with native exception handler?

chadkirby commented 6 months ago

Hey @chadkirby, out of curiosity, have you tried on latest version with native exception handler?

I did. IIRC, I saw a modest performance improvement, but wasm speed was still roughly 3x slower than native.

felladrin commented 6 months ago

One important consideration is that certain browsers, such as Brave, may alter the value of navigator.hardwareConcurrency to prevent fingerprinting.

As a result, it is possible that the browser was utilizing only 2 threads, leading to slow inference.

Using 8 threads has resulted in satisfactory performance for the Phi-3 model:

https://github.com/ngxson/wllama/assets/418083/a33b93f9-31e6-4acc-9636-620058d60767

Details image image