load Mistral-Nemo-Instruct-2407(exl2) fail

Describe the bug

Can't load Mistral-Nemo-Instruct-2407, Insufficient VRAM for model and cache This model size is 12B and my GPU memory is 120G.

Is there an existing issue for this?

[X] I have searched the existing issues

Reproduction

load model error.

Screenshot

No response

Logs

09:45:47-565763 INFO     Loading
                         "DrNicefellow_Mistral-Nemo-Instruct-2407-exl2-5bpw"    
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /app/server.py:242 in <module>                                               │
│                                                                              │
│   241         # Load the model                                               │
│ ❱ 242         shared.model, shared.tokenizer = load_model(model_name)        │
│   243         if shared.args.lora:                                           │
│                                                                              │
│ /app/modules/models.py:87 in load_model                                      │
│                                                                              │
│    86     shared.args.loader = loader                                        │
│ ❱  87     output = load_func_map[loader](model_name)                         │
│    88     if type(output) is tuple:                                          │
│                                                                              │
│ /app/modules/models.py:373 in ExLlamav2_loader                               │
│                                                                              │
│   372                                                                        │
│ ❱ 373     model, tokenizer = Exllamav2Model.from_pretrained(model_name)      │
│   374     return model, tokenizer                                            │
│                                                                              │
│ /app/modules/exllamav2.py:70 in from_pretrained                              │
│                                                                              │
│    69         if shared.args.autosplit:                                      │
│ ❱  70             model.load_autosplit(cache)                                │
│    71                                                                        │
│                                                                              │
│ /venv/lib/python3.10/site-packages/exllamav2/model.py:349 in load_autosplit  │
│                                                                              │
│   348         f = self.load_autosplit_gen(cache, reserve_vram, last_id_only, │
│ ❱ 349         for item in f: x = item                                        │
│   350                                                                        │
│                                                                              │
│ /venv/lib/python3.10/site-packages/exllamav2/model.py:476 in                 │
│ load_autosplit_gen                                                           │
│                                                                              │
│   475                         if current_device >= num_devices:              │
│ ❱ 476                             raise RuntimeError("Insufficient VRAM for  │
│   477                                                                        │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Insufficient VRAM for model and cache

System Info

[0] NVIDIA A100-SXM4-40GB | 37°C,   0 % |     0 / 40960 MB |
[1] NVIDIA A100-SXM4-40GB | 41°C,   0 % |     0 / 40960 MB |
[2] NVIDIA A100-SXM4-40GB | 38°C,   0 % |   533 / 40960 MB | root(528M)

oobabooga / text-generation-webui

load Mistral-Nemo-Instruct-2407(exl2) fail #6267

Describe the bug

Is there an existing issue for this?

Reproduction

Screenshot

Logs

System Info