oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.
GNU Affero General Public License v3.0
40.48k stars 5.31k forks source link

Multi GPU for larger (Auto-)GPTQ models not working #3477

Closed anphex closed 1 year ago

anphex commented 1 year ago

Describe the bug

When using a small 3-8 bit model that could fit into a single GPU (3090) anyway, there are no issues and Windows Task Manager also shows that both models receive "something" in their VRAM in about equal amounts.

WHen using a bigger model where it has to be split in two similar GPUs, TGI just falls back into loading it onto CPU and mempry while failing obviously. I tried about every setting: auto-devices, different splits, different flags - to no avail.

Is there an existing issue for this?

Reproduction

Use latest TGI version via the .bat-updater on August 6th 2023

  1. Acquire an GPTQed model that fits theoretically into 48GB Vram split over two similar GPUS
  2. Try to load said model in TGI (with and without auto-devices)

Screenshot

autodevices Manually

Logs

2023-08-06 14:42:25 INFO:Loading TheBloke_llama-2-70b-Guanaco-QLoRA-GPTQ_gptq-3bit-64g-actorder_True...
2023-08-06 14:42:25 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-3bit-64g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': {0: '24500MiB', 1: '24500MiB', 'cpu': '99GiB'}, 'quantize_config': None, 'use_cuda_fp16': True}
2023-08-06 14:42:50 ERROR:Failed to load the model.
Traceback (most recent call last):
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\text-generation-webui\server.py", line 68, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\text-generation-webui\modules\models.py", line 78, in load_model
    output = load_func_map[loader](model_name)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\text-generation-webui\modules\models.py", line 287, in AutoGPTQ_loader
    return modules.AutoGPTQ_loader.load_quantized(model_name)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\text-generation-webui\modules\AutoGPTQ_loader.py", line 56, in load_quantized
    model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\auto_gptq\modeling\auto.py", line 94, in from_quantized
    return quant_func(
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\auto_gptq\modeling\_base.py", line 736, in from_quantized
    model = AutoModelForCausalLM.from_config(
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 430, in from_config
    return model_class._from_config(config, **kwargs)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 1142, in _from_config
    model = cls(config, **kwargs)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 732, in __init__
    self.model = LlamaModel(config)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 561, in __init__
    self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 561, in <listcomp>
    self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 376, in __init__
    self.mlp = LlamaMLP(config)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 198, in __init__
    self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
  File "D:\_GENERATION_\TEXTGENERATIONWEBUI\installer_files\env\lib\site-packages\torch\nn\modules\linear.py", line 96, in __init__
    self.weight = Parameter(torch.empty((out_features, in_features), **factory_kwargs))
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 469762048 bytes.

System Info

Windows 11 latest
2x Nvidia RTX 3090
Intel 13700k
32GB Ram
UnreasonableLevitation commented 1 year ago

I'm encountering the same problem on Ubuntu 20 with a 3080 Ti and a 1060 6GB.

Ph0rk0z commented 1 year ago

ok, I'm dumb and didn't use thousands for MB

UnreasonableLevitation commented 1 year ago

I found the issue. It works if you specify the final layer that will be loaded to each GPU instead of the number of layers to load onto each GPU (from llama_inference_offload.py in GPTQ-for-LLaMa). I wanted to load 35 layers onto cuda:0 and 5 onto cuda:1, so the correct argument is --pre-layer 35 40 and not --pre-layer 35 5.

UnreasonableLevitation commented 1 year ago

Ah, my bad. Missed the Auto- part. My issue was that I was confused by this and assumed that "the numbers" referred to "the number of layers" (as for memory in --gpu-split) and not e.g. "the numbers that describe the layer distribution".

anphex commented 1 year ago

Through some help on TheBloke Discord we discovered that setting a rather large page file when you have low ram (32GB for example) so the page file can be used as buffer while loading the model into VRAM.

When using ExLlama(-HF) I get to load all 70b-4b models cleanly now but the split especially in AutoGPTQ doesn't seem to work. (While "work" being by expectation that the loader finds all specs by itself in sets the optimal split).

I can only emphasize, ExLlama does it job perfectly when using a 17,23 split on two 3090s. Didn't try GPTQ-for-Llama yet.

Ph0rk0z commented 1 year ago

GPTQ-for-llama needs that pre-layer otherwise it will be slow. I don't think it supports the new attention in the 70b tho.

github-actions[bot] commented 1 year ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.