ngxson / wllama

WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
https://huggingface.co/spaces/ngxson/wllama
MIT License
444 stars 23 forks source link

Out Of Memory error in Wllama with multi-threads on iOS browser #18

Closed felladrin closed 6 months ago

felladrin commented 6 months ago

I'm experiencing an Out Of Memory error when attempting to run Wllama with multi-threads on an iOS browser.

It occurs regardless of the model size, although navigator.hardwareConcurrency is 3 on this browser.

For instance, I can run TinyLlama 1.1B (Q3_K) with single-thread, but even a Llama 68M model fails when I enable multi-thread.

So this problem appears to be related to Wasm or the worker script rather than the models themselves.

To address this issue, I'm using a try-catch: When it throws Out Of Memory, I reinitialize Wllama with { "n_threads": 1 }.

Is anyone else facing this issue?

ngxson commented 6 months ago

I suspect that the OOM error may be resolved if we split the model into smaller chunks before loading it. There's an updated section in README mentioning that:

image

However, I'm not 100% sure if this resolve the issue. Would you mind to test it out? Thank you.

There a test model in advanced example:

https://github.com/ngxson/wllama/blob/65bbcc07c52d9105936bcd783e3e718ca93d1c5f/examples/advanced/index.html#L39-L45

flatsiedatsie commented 6 months ago

// Wrong thread

felladrin commented 6 months ago

Hey, folks, I have good news!

After reading the following issue, I decided to tweak the max wasm memory and confirmed it was the root of the problem:

So, first a bit more context:

So, I decreased the maximum property in the WebAssembly.Memory instantiation.

-wasmMemory=new WebAssembly.Memory({"initial":INITIAL_MEMORY/65536,"maximum":4294967296/65536,"shared":true
+wasmMemory=new WebAssembly.Memory({"initial":INITIAL_MEMORY/65536,"maximum":1288490189/65536,"shared":true

It was decreased from 4GB to 1.2GB (20% of the 6GB of the device). Since then I haven't faced the out-of-memory problem when running multi-threaded Wllama anymore 🎉 . TinyLlama 1.1B is running fast in the mobile browser!

So, we need to find a way to make this Emscripten MAXIMUM_MEMORY configurable (it's currently set to 4GB). Maybe we can do something like we discussed here.

ngxson commented 6 months ago

@felladrin Thanks for the info. Yeah it would be quite annoying to check if we're running on iOS, then set the appropriate max memory.

Another idea would be to write a loop to try multiple values until it successes.