ngxson / wllama

WebAssembly binding for llama.cpp - Enabling on-browser LLM inference
https://huggingface.co/spaces/ngxson/wllama
MIT License
444 stars 23 forks source link

After upgrading to version 1.8.0, the async function `loadModelFromUrl` is not completing when using large models #31

Closed felladrin closed 1 month ago

felladrin commented 6 months ago

Something interesting occurred while upgrading to version 1.8.0. Previously, it had been throwing an "Out of Memory" error, but that issue has now been resolved. However, a new problem has surfaced, where the async function loadModelFromUrl does not complete. It appears to be stuck in a state where it neither resolves nor rejects. It's possible that the error may be caught in the middle of the process and not being passed up.

This issue can be reproduced with models that are too large to fit into the device's memory. It works perfectly fine with smaller models.

It's possible that this problem is related to the changes made in this pull request:

However, as I only encountered this issue on the iOS browser, it's also possible that it's related to this change:

If anyone would like to test this problem, you can use this 10-part split-gguf of TinyLlama on a device with less than 6GB of RAM: https://huggingface.co/Felladrin/gguf-sharded-TinyLlama-1.1B-1T-OpenOrca/resolve/main/tinyllama-1.1b-1t-openorca.Q3_K_S.shard-00001-of-00010.gguf. (If an even larger model is needed, there are also _Q4_KM and _Q80 versions available in this repository.)

ngxson commented 6 months ago

Probably because the out of memory error is now thrown internally by cpp code (and not by worker js code). Can you confirm if you see error from llama_new_context_with_model? (ref. https://github.com/ngxson/wllama/issues/12#issuecomment-2108250211)

flatsiedatsie commented 6 months ago

Sounds like the same issue I came across here?

With version 1.8 Wllama doesn't seem to raise an error though? It just just states the issue in the console. But my code thinks the model has loaded OK, even though it hasn't. Is there a way to get the failed state?

// Doh, you already figured that out :-)

felladrin commented 6 months ago

Thanks for the reference. There is a lot of good info in that thread!

I've just noticed a pattern regarding this issue:

The loadModelFromUrl function is only hanging when running multi-threaded. It doesn't even print the warnings on the console. When I connect the mobile to Safari DevTools, I see the following:

image

From the screenshot, we can see that the device was using n_threads == 2.

When I force it to use n_threads = 1 with the same model, it then prints the warnings and also triggers the error, allowing me to catch it with the try/catch.

Indicating that the loadModelFromUrl is only not completing when using a too-large-model with multi-threading.

PS: I haven't tested your changes from https://github.com/ngxson/wllama/pull/34.

felladrin commented 6 months ago

ℹ️ This issue (loadModelFromUrl hanging when used with multi-threading and loading a too-large model) is still present in v1.9.0. I tried adjusting the stepBytes and maxBytes from getWasmMemory() to see if any combination could resolve the issue, but unfortunately, I couldn't find a solution. I've run out of ideas. Since it's running fine with small models, I've decided not to use large models (> 1 billion parameters) on mobile anymore.

Note: iOS browsers don't clear the memory of web workers properly when reloading the page. For instance, if the page is reloaded before calling wllama.exit(), trying to use wllama.loadModelFromUrl() will run with even lower memory than usual. So this hanging was more evident after reloading the page and re-running the inference. Found these related issues that, unfortunately, don't have a solution:

ngxson commented 6 months ago

@felladrin Sorry for the late response. Yeah seems like there are a lot of problems with Safari on iOS.

This issue (loadModelFromUrl hanging when used with multi-threading and loading a too-large model) is still present in v1.9.0.

Do you get the same error as last time (i.e. Aborted()) ?

iOS browsers don't clear the memory of web workers properly when reloading the page.

Probably we can make the web worker to exit itself when the page reload. But I still doubt doing this, since this should be responsibility of the browser. I'll have a look on this when I have more time.

felladrin commented 6 months ago

Ah, no worries @ngxson! My intention was just to document it, so other devs facing this issue can get some clue. But I'm not waiting it to be fixed, as it's working pretty fine with models with less than 500M params.

Not sure when I'll try larger models on iOS again, but if I find anything new, I'll share here!

felladrin commented 1 month ago

After the launch of iOS 18, most of those issues related to out-of-memory seem to have been gone! 🎉

I noticed that they (Apple) now force Safari to hard-reload the page when it finds it with too low memory. After the reload, with more memory available, the models usually run fine. Wllama can easily run 1B models (e.g. Llama 3.2 1B Q4_K_M) in <6GB-Memory iPhone.

flatsiedatsie commented 1 month ago

Even the next iPhone SE is rumored to have 8GB of memory, so Apple is quickly making 8GB the new baseline. (The latest iPhone also comes with at least 8GB).