Closed flatsiedatsie closed 6 months ago
It seems like the problem comes from llama.cpp, can you try loading the model with llama.cpp native?
The model crashes after a refresh of the page too, so it wasn't having Mistral loaded first. Hmm.
can you try loading the model with llama.cpp native?
That's been a while. I'll try. The model loaded OK with Llama_cpp_wasm.
Here's the .gguf: https://huggingface.co/afrideva/phi-2-meditron-GGUF/resolve/main/phi-2-meditron.q4_k_m.gguf
It runs with llama.cpp. It's not chunked mind you (in fact, I was testing it all below 2Gb models still work without chunking, and it crashed on the first one :-D )
I believe that the problem comes from the model itself or llama.cpp. That shouldn't be a problem from wllama.
In anyway, I'll update to latest upstream source code today to see if it's fixed.
I continued testing, and it's happening with all the .gguf models now.
When I uncomment model_settings['cache_type_k'] = 'q4_0';
everything works again.
E.g. https://huggingface.co/TheBloke/rocket-3B-GGUF/resolve/main/rocket-3b.Q5_K_M.gguf https://huggingface.co/afrideva/Nous-Capybara-3B-V1.9-GGUF/resolve/main/nous-capybara-3b-v1.9.q5_k_m.gguf
I was wondering if it had to do with the .gguf models not being quantized in Q4 themselves.
With cache_type_k
set to q4_0
the minuscule qwen does load.
https://huggingface.co/Qwen/Qwen1.5-0.5B-Chat-GGUF/resolve/main/qwen1_5-0_5b-chat-q4_0.gguf
Hmm, NeuralReyna, which is chunked, also loads. It's also natively Q4.
Did another test. Chunked a Q4 version of Phi 3 128K (2.18Gb) into 9 chunks, and it loads an runs OK with the Q4 cache type enabled.
Yeah I heard a while a go on llama.cpp that cache type Q4 does not play very nice with phi models. Since quantize cache is an experimental thing on llama.cpp, we may expect it to fail on some models.
Well, here it's the opposite :-D The Phi models do work.
Since quantize cache is an experimental thing on llama.cpp, we may expect it to fail on some models.
Aha! I didn't know that. Thank you.
I'm going to test a bit more, and see if the issue can be solved by chunking those Q5 models. If not, then I'll create a setting to simply not use cache_type_k
on those deviants.
I'll also try to confirm if it has to do with models being quantized to Q3 / Q5 instead of Q4.
Tested Rocket 3B Q4 quant, but that also crashed. So it doesn't seem related to models being Q4 themselves.
I can only conclude that, as you say, for some models it just randomly doesn't work.
I'll create a variable in my code that enables or disables cache_type_k
as needed.
I sometimes see this warning. Is it something to be worried about?
Is this perhaps related to the need for all .guff files needing to be remade after the Llama.cpp project ran into a bug with Llama 3?
In this case the model crashed, but I suspect that has more to do with me not properly unloading the previous model (Mistral 7) before switching to this one (Phi 2).