Open watchfoxie opened 1 month ago
Running into the same issue
I think your VRAM is too small. 3.1 has 128k == 131072 context_length and in case of Q8_0 it's barely enough even with 24G VRAM, try reducing the n_ctx value.
I have 16gb vram and ran into the same error with settings all the way down to 512 context length and only one layer on the gpu, without difference.
Can you show the entire log? I am not professional about this, but others are not. Some of them might be able to help you with something.
I deleted the model already and got a different one, but I believe it is related to this llama.cpp bug https://github.com/ollama/ollama/issues/6048, which should be resolved by now
Were you able to get it to work using a different model? I've tried a few different GGUF versions and result is the same.
@nichjamesr sorry, I forgot to add that I got a nice abliterated model through the ollama page to just use with cmd. Works nicely, but unfortunately no solution for this issue here. Should have asked llama about that come to think of it
Maybe the models made during the buggy llama.cpp version need to get patched themselves as well to be compatible again? Did you try looking for some very new ones, just for testing purposes? Not sure when this was exaclty fixed in llama.cpp, but the newer the more likely it would work, if my guess holds any truth
@PrometheusDante I'm not sure why, but it seems like for me lowering the context actually did the trick. I'm on a 3060ti (8GB). The same model that wouldn't load at 128k loads fine if I set to 64k or below.
I always set standard context length 8096, this is not the cause. Regarding model settings and parameters, I always take care before loading.
So, i found the point of issue, this is the python script "convert_hf_to_gguf.py" one of these commit updates ruined compatibility #8627 or #8676. Temporary solution is to use old llama.cpp backend to create FP16 model, or to take already quantized from HF (example GAIANET one). After that, you can use recent release of B3xxx to obtain desired quant without load issues in Web UI.
I always set standard context length 8096, this is not the cause. Regarding model settings and parameters, I always take care before loading.
So, i found the point of issue, this is the python script "convert_hf_to_gguf.py" one of these commit updates ruined compatibility #8627 or #8676. Temporary solution is to use old llama.cpp backend to create FP16 model, or to take already quantized from HF (example GAIANET one). After that, you can use recent release of B3xxx to obtain desired quant without load issues in Web UI.
Hi, may I ask you to explain how to do it in details?
Describe the bug
I have downloaded Hugging Face "meta-llama/Meta-Llama-3.1-8B-Instruct" model to do Q8_0 type quantization using the latest llama.cpp to keep it up-to-date, increase efficiency and remove the shortcomings of the old quantization. However, the problem of not being able to load the newly quantized model arises.
I thought to try ready quantized model from experienced publisher (Bartowki), maybe I admit mistakes in the process, same error result when uploading model to Web UI. Older GGUF models of "meta-llama/Meta-Llama-3.1-8B-Instruct" works fine.
Is there an existing issue for this?
Reproduction
Screenshot
Logs
System Info