Closed ngxson closed 6 months ago
@felladrin Would you mind to test this out? I have not tested with model bigger than 2B yet, but hopefully with this PR, we can load 7B Q4_K
Thanks, @ngxson. I compiled and tested the changes with Q4_0, Q4_K_S, and Q4_K_M. And uploaded the split-gguf of those 3 quants here.
Results:
_Q40 and _Q4_KS work! (Using the allenai/OLMo-7B-Instruct model, because Llama/Mistral models have larger GGUF files than OLMo)
But there's not much space left in the 4GB memory reserved for WASM to use for the n_ctx
. So I could at maximum use n_ctx = 512
for Q4_0 and n_ctx = 256
for Q4_K_S. [Always using cache_type_k: "q4_0"
]
Here are screenshots of Q4_K_S:
Q4_K_M fails because the minimum n_ctx
is 256, so it doesn't fit the memory along with the model, which is reported to be 3.90 GiB
:
So I believe this change is an improvement and should be merged, but we still cannot use Q4_K_M because most of the 7B models have a Q4_K_M that is too large.
Extra info: This is the failure message I get when the file is too large (Llama/Mistral Q4_K_M) and it doesn't even start the inference:
@felladrin Thanks for the test. Seems like Q4_K_M is still too large (generally say). But still, this PR doesn't broken anything :smiley:
I'm deploying v1.8.1 with this today
Off topic, but out of curiosity: how does WebLLM seemingly work around this? Do they split the model and context into separete WASM's or something? (I have no idea what I'm saying).
And pragmatically: if 4Gb is a hard limit, with Phi-3-mini-128k-instruct-Q4_0.gguf, which is 2.18Gb, how much context could theoreticaly fit in the remaining 1.8Gb?
@flatsiedatsie I not sure but looking at a WebLLM compatible model, model seems to be splitted into small shards. Maybe they load each shard into GPU then delete it from main memory.
Also I don't have details about quantize q4f16_1
they're using, so I don't know if it's comparable with Q4_K_M.
Interesting. WebLLM also supports 70B models, which blows my mind. Those chunks are a lot bigger / more varied:
https://huggingface.co/mlc-ai/Llama-3-70B-Instruct-q3f16_1-MLC/tree/main
Before merging, can you try it on an iOS browser, @ngxson?
I noticed that it's throwing Out of memory
whenever I use multi-threading, Independently of the model and the n_ctx
. [single-thread works fine] (Tested with a 160 million parameters model that is working fine on v1.8.0)
So maybe it needs a fallback.
@felladrin I attempt to fix it in https://github.com/ngxson/wllama/pull/39/commits/b70fa5acb33ddfd87cc844ce39a6f7f5b5fc5301 , can you give a try? Thanks.
Now it's working flawlessly in iOS, @ngxson! 🚀
I've also re-checked the latest change in desktop, with the 7B Q4_K_S and it's all good!
Related to #35
Files are now allocated directly inside WASM heap (instead of copying into JS heap like what memfs does)
This should greatly reduce the number of copy, saving more RAM.
As a bonus,
mmap
now map the memory directly to heap (instead of doing memcpy)