huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.95k stars 26.28k forks source link

Incomplete memory allocation of dual GPU #32412

Open Skit5 opened 1 month ago

Skit5 commented 1 month ago

System Info

Who can help?

text models: @ArthurZucker

Information

Tasks

Reproduction

It occurs when using HF Transformers with Text-Generation WebUI (ticket opened here: https://github.com/oobabooga/text-generation-webui/issues/6028) but was reproduced on a simple loading test in a Jupyter Notebook using the base conda environment, therefore I open a ticket here as it's probably a HF Transformers issue.

The problem happens when a large model (Llama-3-70B with q4 parameters) gets loaded in a dual GPU (RTX 3090+RTX 3090 Ti); a large part of the VRAM isn't allocated (GPU0: 94%, GPU1: 66%). This is causing a torch.cuda.OutOfMemoryError for a modest length of text.

17:34:28-952126 INFO     Loading "Meta-Llama-3-70B-Instruct"                    
17:34:28-968657 INFO     TRANSFORMERS_PARAMS=                                   
{   'low_cpu_mem_usage': True,
    'torch_dtype': torch.bfloat16,
    'use_flash_attention_2': True,
    'device_map': 'auto',
    'max_memory': {0: '24200MiB', 1: '24200MiB', 'cpu': '99GiB'},
    'quantization_config': BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": true,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}
}

Expected behavior

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacity of 23.68 GiB of which 77.69 MiB is free. Including non-PyTorch memory, this process has 22.66 GiB memory in use. Of the allocated memory 22.28 GiB is allocated by PyTorch, and 70.71 MiB is reserved by PyTorch but unallocated.

There are two issues with this:

ArthurZucker commented 1 month ago

Hey! This seems more related to accelerate pinging @SunMarc and @muellerzr !

SunMarc commented 1 month ago

Hey @Skit5, thanks for reporting ! It seems from the traceback that this is the gpu 1 that goes oom. Could you try to change a few params in TRANSFORMERS_PARAMS to see if it behaves better ? I would suggest you to change device_map to "sequential" and try to tweak a bit the value of 'max_memory' ={0: '24200MiB', 1: '24200MiB'}, so that you don't experience oom ? We will try to change a bit how device_map='auto' behaves with bnb so that there is such unbalanced allocation.

github-actions[bot] commented 2 hours ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.