ngxson / wllama

WebAssembly binding for llama.cpp - Enabling in-browser LLM inference
https://huggingface.co/spaces/ngxson/wllama
MIT License
441 stars 21 forks source link

Introduce heapfs #39

Closed ngxson closed 6 months ago

ngxson commented 6 months ago

Related to #35

Files are now allocated directly inside WASM heap (instead of copying into JS heap like what memfs does)

This should greatly reduce the number of copy, saving more RAM.

As a bonus, mmap now map the memory directly to heap (instead of doing memcpy)

ngxson commented 6 months ago

@felladrin Would you mind to test this out? I have not tested with model bigger than 2B yet, but hopefully with this PR, we can load 7B Q4_K

felladrin commented 6 months ago

Thanks, @ngxson. I compiled and tested the changes with Q4_0, Q4_K_S, and Q4_K_M. And uploaded the split-gguf of those 3 quants here.


Results:

_Q40 and _Q4_KS work! (Using the allenai/OLMo-7B-Instruct model, because Llama/Mistral models have larger GGUF files than OLMo)

But there's not much space left in the 4GB memory reserved for WASM to use for the n_ctx. So I could at maximum use n_ctx = 512 for Q4_0 and n_ctx = 256 for Q4_K_S. [Always using cache_type_k: "q4_0"]

Here are screenshots of Q4_K_S:

image image

Q4_K_M fails because the minimum n_ctx is 256, so it doesn't fit the memory along with the model, which is reported to be 3.90 GiB:

image image

So I believe this change is an improvement and should be merged, but we still cannot use Q4_K_M because most of the 7B models have a Q4_K_M that is too large.


Extra info: This is the failure message I get when the file is too large (Llama/Mistral Q4_K_M) and it doesn't even start the inference:

image
ngxson commented 6 months ago

@felladrin Thanks for the test. Seems like Q4_K_M is still too large (generally say). But still, this PR doesn't broken anything :smiley:

I'm deploying v1.8.1 with this today

flatsiedatsie commented 6 months ago

Off topic, but out of curiosity: how does WebLLM seemingly work around this? Do they split the model and context into separete WASM's or something? (I have no idea what I'm saying).

And pragmatically: if 4Gb is a hard limit, with Phi-3-mini-128k-instruct-Q4_0.gguf, which is 2.18Gb, how much context could theoreticaly fit in the remaining 1.8Gb?

ngxson commented 6 months ago

@flatsiedatsie I not sure but looking at a WebLLM compatible model, model seems to be splitted into small shards. Maybe they load each shard into GPU then delete it from main memory.

Also I don't have details about quantize q4f16_1 they're using, so I don't know if it's comparable with Q4_K_M.

flatsiedatsie commented 6 months ago

Interesting. WebLLM also supports 70B models, which blows my mind. Those chunks are a lot bigger / more varied:

https://huggingface.co/mlc-ai/Llama-3-70B-Instruct-q3f16_1-MLC/tree/main

felladrin commented 6 months ago

Before merging, can you try it on an iOS browser, @ngxson?

I noticed that it's throwing Out of memory whenever I use multi-threading, Independently of the model and the n_ctx. [single-thread works fine] (Tested with a 160 million parameters model that is working fine on v1.8.0) So maybe it needs a fallback.

image
ngxson commented 6 months ago

@felladrin I attempt to fix it in https://github.com/ngxson/wllama/pull/39/commits/b70fa5acb33ddfd87cc844ce39a6f7f5b5fc5301 , can you give a try? Thanks.

felladrin commented 6 months ago

Now it's working flawlessly in iOS, @ngxson! 🚀

Screenshots image image image

I've also re-checked the latest change in desktop, with the 7B Q4_K_S and it's all good!

Screenshots image image image