Open turandot2017 opened 1 month ago
Can't load Mistral-Nemo-Instruct-2407, Insufficient VRAM for model and cache This model size is 12B and my GPU memory is 120G.
Insufficient VRAM for model and cache
load model error.
No response
09:45:47-565763 INFO Loading "DrNicefellow_Mistral-Nemo-Instruct-2407-exl2-5bpw" ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /app/server.py:242 in <module> │ │ │ │ 241 # Load the model │ │ ❱ 242 shared.model, shared.tokenizer = load_model(model_name) │ │ 243 if shared.args.lora: │ │ │ │ /app/modules/models.py:87 in load_model │ │ │ │ 86 shared.args.loader = loader │ │ ❱ 87 output = load_func_map[loader](model_name) │ │ 88 if type(output) is tuple: │ │ │ │ /app/modules/models.py:373 in ExLlamav2_loader │ │ │ │ 372 │ │ ❱ 373 model, tokenizer = Exllamav2Model.from_pretrained(model_name) │ │ 374 return model, tokenizer │ │ │ │ /app/modules/exllamav2.py:70 in from_pretrained │ │ │ │ 69 if shared.args.autosplit: │ │ ❱ 70 model.load_autosplit(cache) │ │ 71 │ │ │ │ /venv/lib/python3.10/site-packages/exllamav2/model.py:349 in load_autosplit │ │ │ │ 348 f = self.load_autosplit_gen(cache, reserve_vram, last_id_only, │ │ ❱ 349 for item in f: x = item │ │ 350 │ │ │ │ /venv/lib/python3.10/site-packages/exllamav2/model.py:476 in │ │ load_autosplit_gen │ │ │ │ 475 if current_device >= num_devices: │ │ ❱ 476 raise RuntimeError("Insufficient VRAM for │ │ 477 │ ╰──────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Insufficient VRAM for model and cache
[0] NVIDIA A100-SXM4-40GB | 37°C, 0 % | 0 / 40960 MB | [1] NVIDIA A100-SXM4-40GB | 41°C, 0 % | 0 / 40960 MB | [2] NVIDIA A100-SXM4-40GB | 38°C, 0 % | 533 / 40960 MB | root(528M)
I'm encountering the same issue.
Describe the bug
Can't load Mistral-Nemo-Instruct-2407,
Insufficient VRAM for model and cache
This model size is 12B and my GPU memory is 120G.Is there an existing issue for this?
Reproduction
load model error.
Screenshot
No response
Logs
System Info