Open chadkirby opened 7 months ago
It is expected, since WebAssembly SIMD only support the equivalent to AVX instruction, not AVX2. This should be the biggest impact to performance atm.
Another issue is that we're using emscripten's non-native exception handler which maintains support with older browsers, but come with a small performance cost. We may move to native exception handler in the future.
Edit: seems like most mainstream versions of browsers already support native wasm exception (see here), so it's safe to enable it. The support will be added in the next build of wllama.
v1.6.0 is now using native exception handler via -fwasm-exceptions
. Here is the matrix for browser support: https://webassembly.org/features/
Hey @chadkirby, out of curiosity, have you tried on latest version with native exception handler?
Hey @chadkirby, out of curiosity, have you tried on latest version with native exception handler?
I did. IIRC, I saw a modest performance improvement, but wasm speed was still roughly 3x slower than native.
One important consideration is that certain browsers, such as Brave, may alter the value of navigator.hardwareConcurrency
to prevent fingerprinting.
As a result, it is possible that the browser was utilizing only 2 threads, leading to slow inference.
Using 8 threads has resulted in satisfactory performance for the Phi-3 model:
https://github.com/ngxson/wllama/assets/418083/a33b93f9-31e6-4acc-9636-620058d60767
First, thanks for putting this project together!
I modified
examples/basic/index.html
to use a more capable model:https://huggingface.co/lmstudio-ai/gemma-2b-it-GGUF/resolve/main/gemma-2b-it-q4_k_m.gguf
, which is 1.5gb.Using LM Studio on my laptop (with GPU Acceleration disabled), I get roughly 25 tokens per second from
gemma-2b-it-q4_k_m.gguf
.Running
examples/basic/index.html
in Chrome 124 on my laptop, I get roughly 6-7 tokens per second fromgemma-2b-it-q4_k_m.gguf
. (Similar performance in Edge 123.)Generally, the wasm bindings seem roughly 3-4x slower than native. Is that more or less expected? Are there any
wllama
knobs I can twiddle to improve performance?