Open dlippold opened 2 months ago
"ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30649.55 MiB on device 0: cudaMalloc failed: out of memory"
is the relevant line, the rest of the output comes because of the python wrappers that the call was nested in also need to close. Thus those print to console after. I think.
There isn't "custom" errors as far as I know, the calls are just to llama.cpp under the hood and the errors are those responses. And while I get what you're hoping to see, and what I wrote above can be worked around, there is many more use cases that are already being built for it that this would add an un needed complexity and potential change or become irrelevant.
An example is model loading. Imagine I load 1.2 whole models in GPU A and 0.6 of the second model in the GPU B and the leftover into Sys RAM, which would be dope. That logic would make it so a statement like "There was not enough VRAM to load the model for GPU. Therefore it is loaded into RAM for processing by CPU." require much more clarity to make sense again. So I think its best it stays the way it is as the feature set, the calculations and more are continually evolving. 2 cents really.
Describe the bug
When there is not enough VRAM to load a GGUF model then an out of memory (OOM) error is thrown. In addition the last lines of the error message are misleading:
Instead of throwing an OOM error it should be tried to load the model into RAM (for processing by CPU) and a clear message should be output, somethink like
If there is also not enough RAM to load the model, another clear message should be output, somethink like
This would also avoid the problem reported on #5341.
Is there an existing issue for this?
Reproduction
Load a GGUF model which uses almost the full VRAM. Then start a second process (e.g. by a second user) which tries to load another large GGUF model.
Screenshot
No response
Logs
System Info