oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
39.55k stars 5.2k forks source link

load Mistral-Nemo-Instruct-2407(exl2) fail #6267

Open turandot2017 opened 1 month ago

turandot2017 commented 1 month ago

Describe the bug

Can't load Mistral-Nemo-Instruct-2407, Insufficient VRAM for model and cache This model size is 12B and my GPU memory is 120G.

Is there an existing issue for this?

Reproduction

load model error.

Screenshot

No response

Logs

09:45:47-565763 INFO     Loading
                         "DrNicefellow_Mistral-Nemo-Instruct-2407-exl2-5bpw"    
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /app/server.py:242 in <module>                                               │
│                                                                              │
│   241         # Load the model                                               │
│ ❱ 242         shared.model, shared.tokenizer = load_model(model_name)        │
│   243         if shared.args.lora:                                           │
│                                                                              │
│ /app/modules/models.py:87 in load_model                                      │
│                                                                              │
│    86     shared.args.loader = loader                                        │
│ ❱  87     output = load_func_map[loader](model_name)                         │
│    88     if type(output) is tuple:                                          │
│                                                                              │
│ /app/modules/models.py:373 in ExLlamav2_loader                               │
│                                                                              │
│   372                                                                        │
│ ❱ 373     model, tokenizer = Exllamav2Model.from_pretrained(model_name)      │
│   374     return model, tokenizer                                            │
│                                                                              │
│ /app/modules/exllamav2.py:70 in from_pretrained                              │
│                                                                              │
│    69         if shared.args.autosplit:                                      │
│ ❱  70             model.load_autosplit(cache)                                │
│    71                                                                        │
│                                                                              │
│ /venv/lib/python3.10/site-packages/exllamav2/model.py:349 in load_autosplit  │
│                                                                              │
│   348         f = self.load_autosplit_gen(cache, reserve_vram, last_id_only, │
│ ❱ 349         for item in f: x = item                                        │
│   350                                                                        │
│                                                                              │
│ /venv/lib/python3.10/site-packages/exllamav2/model.py:476 in                 │
│ load_autosplit_gen                                                           │
│                                                                              │
│   475                         if current_device >= num_devices:              │
│ ❱ 476                             raise RuntimeError("Insufficient VRAM for  │
│   477                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Insufficient VRAM for model and cache

System Info

[0] NVIDIA A100-SXM4-40GB | 37°C,   0 % |     0 / 40960 MB |
[1] NVIDIA A100-SXM4-40GB | 41°C,   0 % |     0 / 40960 MB |
[2] NVIDIA A100-SXM4-40GB | 38°C,   0 % |   533 / 40960 MB | root(528M)
carllinnaeus43 commented 1 month ago

I'm encountering the same issue.