Handle insufficient memory reasonable and clearly

Describe the bug

When there is not enough VRAM to load a GGUF model then an out of memory (OOM) error is thrown. In addition the last lines of the error message are misleading:

Traceback (most recent call last): File "/home/textgen/text-generation-webui-240504/modules/llamacpp_model.py", line 58, in del del self.model ^^^^^^^^^^ AttributeError: 'LlamaCppModel' object has no attribute 'model'

Instead of throwing an OOM error it should be tried to load the model into RAM (for processing by CPU) and a clear message should be output, somethink like

There was not enough VRAM to load the model for GPU. Therefore it is loaded into RAM for processing by CPU.

If there is also not enough RAM to load the model, another clear message should be output, somethink like

There was neither enough VRAM nor enough RAM to load the model.

This would also avoid the problem reported on #5341.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

Load a GGUF model which uses almost the full VRAM. Then start a second process (e.g. by a second user) which tries to load another large GGUF model.

Screenshot

No response

Logs

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30649.55 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
╭───────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────╮
│ /home/textgen/text-generation-webui-240504/server.py:242 in <module>                                                 │
│                                                                                                                      │
│   241         # Load the model                                                                                       │
│ ❱ 242         shared.model, shared.tokenizer = load_model(model_name)                                                │
│   243         if shared.args.lora:                                                                                   │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-240504/modules/models.py:94 in load_model                                        │
│                                                                                                                      │
│    93     shared.args.loader = loader                                                                                │
│ ❱  94     output = load_func_map[loader](model_name)                                                                 │
│    95     if type(output) is tuple:                                                                                  │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-240504/modules/models.py:272 in llamacpp_loader                                  │
│                                                                                                                      │
│   271     logger.info(f"llama.cpp weights detected: \"{model_file}\"")                                               │
│ ❱ 272     model, tokenizer = LlamaCppModel.from_pretrained(model_file)                                               │
│   273     return model, tokenizer                                                                                    │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-240504/modules/llamacpp_model.py:102 in from_pretrained                          │
│                                                                                                                      │
│   101                                                                                                                │
│ ❱ 102         result.model = Llama(**params)                                                                         │
│   103         if cache_capacity > 0:                                                                                 │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda_tensorcores │
│ /llama.py:323 in __init__                                                                                            │
│                                                                                                                      │
│    322                                                                                                               │
│ ❱  323         self._model = _LlamaModel(                                                                            │
│    324             path_model=self.model_path, params=self.model_params, verbose=self.verbose                        │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda_tensorcores │
│ /_internals.py:55 in __init__                                                                                        │
│                                                                                                                      │
│    54         if self.model is None:                                                                                 │
│ ❱  55             raise ValueError(f"Failed to load model from file: {path_model}")                                  │
│    56                                                                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Failed to load model from file: models/Mixtral_Instruct_8x7B.gguf
Exception ignored in: <function LlamaCppModel.__del__ at 0x7ff7db2022a0>
Traceback (most recent call last):
  File "/home/textgen/text-generation-webui-240504/modules/llamacpp_model.py", line 58, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

System Info

NVIDIA RTX 8000
Intel Xeon CPU
Ubuntu Linux 22.04.

oobabooga / text-generation-webui