oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
38.21k stars 5.07k forks source link

Handle insufficient memory reasonable and clearly #5987

Open dlippold opened 2 months ago

dlippold commented 2 months ago

Describe the bug

When there is not enough VRAM to load a GGUF model then an out of memory (OOM) error is thrown. In addition the last lines of the error message are misleading:

Traceback (most recent call last): File "/home/textgen/text-generation-webui-240504/modules/llamacpp_model.py", line 58, in del del self.model ^^^^^^^^^^ AttributeError: 'LlamaCppModel' object has no attribute 'model'

Instead of throwing an OOM error it should be tried to load the model into RAM (for processing by CPU) and a clear message should be output, somethink like

There was not enough VRAM to load the model for GPU. Therefore it is loaded into RAM for processing by CPU.

If there is also not enough RAM to load the model, another clear message should be output, somethink like

There was neither enough VRAM nor enough RAM to load the model.

This would also avoid the problem reported on #5341.

Is there an existing issue for this?

Reproduction

Load a GGUF model which uses almost the full VRAM. Then start a second process (e.g. by a second user) which tries to load another large GGUF model.

Screenshot

No response

Logs

ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30649.55 MiB on device 0: cudaMalloc failed: out of memory
llama_model_load: error loading model: unable to allocate backend buffer
llama_load_model_from_file: failed to load model
╭───────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────╮
│ /home/textgen/text-generation-webui-240504/server.py:242 in <module>                                                 │
│                                                                                                                      │
│   241         # Load the model                                                                                       │
│ ❱ 242         shared.model, shared.tokenizer = load_model(model_name)                                                │
│   243         if shared.args.lora:                                                                                   │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-240504/modules/models.py:94 in load_model                                        │
│                                                                                                                      │
│    93     shared.args.loader = loader                                                                                │
│ ❱  94     output = load_func_map[loader](model_name)                                                                 │
│    95     if type(output) is tuple:                                                                                  │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-240504/modules/models.py:272 in llamacpp_loader                                  │
│                                                                                                                      │
│   271     logger.info(f"llama.cpp weights detected: \"{model_file}\"")                                               │
│ ❱ 272     model, tokenizer = LlamaCppModel.from_pretrained(model_file)                                               │
│   273     return model, tokenizer                                                                                    │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-240504/modules/llamacpp_model.py:102 in from_pretrained                          │
│                                                                                                                      │
│   101                                                                                                                │
│ ❱ 102         result.model = Llama(**params)                                                                         │
│   103         if cache_capacity > 0:                                                                                 │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda_tensorcores │
│ /llama.py:323 in __init__                                                                                            │
│                                                                                                                      │
│    322                                                                                                               │
│ ❱  323         self._model = _LlamaModel(                                                                            │
│    324             path_model=self.model_path, params=self.model_params, verbose=self.verbose                        │
│                                                                                                                      │
│ /home/textgen/text-generation-webui-main/installer_files/env/lib/python3.11/site-packages/llama_cpp_cuda_tensorcores │
│ /_internals.py:55 in __init__                                                                                        │
│                                                                                                                      │
│    54         if self.model is None:                                                                                 │
│ ❱  55             raise ValueError(f"Failed to load model from file: {path_model}")                                  │
│    56                                                                                                                │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Failed to load model from file: models/Mixtral_Instruct_8x7B.gguf
Exception ignored in: <function LlamaCppModel.__del__ at 0x7ff7db2022a0>
Traceback (most recent call last):
  File "/home/textgen/text-generation-webui-240504/modules/llamacpp_model.py", line 58, in __del__
    del self.model
        ^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

System Info

NVIDIA RTX 8000
Intel Xeon CPU
Ubuntu Linux 22.04.
erasmus74 commented 1 month ago

"ggml_backend_cuda_buffer_type_alloc_buffer: allocating 30649.55 MiB on device 0: cudaMalloc failed: out of memory"

is the relevant line, the rest of the output comes because of the python wrappers that the call was nested in also need to close. Thus those print to console after. I think.

There isn't "custom" errors as far as I know, the calls are just to llama.cpp under the hood and the errors are those responses. And while I get what you're hoping to see, and what I wrote above can be worked around, there is many more use cases that are already being built for it that this would add an un needed complexity and potential change or become irrelevant.

An example is model loading. Imagine I load 1.2 whole models in GPU A and 0.6 of the second model in the GPU B and the leftover into Sys RAM, which would be dope. That logic would make it so a statement like "There was not enough VRAM to load the model for GPU. Therefore it is loaded into RAM for processing by CPU." require much more clarity to make sense again. So I think its best it stays the way it is as the feature set, the calculations and more are continually evolving. 2 cents really.