Incomplete memory allocation of dual GPU

Skit5 commented 1 month ago

System Info

transformers version: 4.43.0.dev0
Platform: Linux-5.15.0-117-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.4
Safetensors version: 0.4.3
Accelerate version: 0.31.0
Accelerate config: not found
PyTorch version (GPU?): 2.2.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 3090 Ti

Who can help?

text models: @ArthurZucker

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

It occurs when using HF Transformers with Text-Generation WebUI (ticket opened here: https://github.com/oobabooga/text-generation-webui/issues/6028) but was reproduced on a simple loading test in a Jupyter Notebook using the base conda environment, therefore I open a ticket here as it's probably a HF Transformers issue.

The problem happens when a large model (Llama-3-70B with q4 parameters) gets loaded in a dual GPU (RTX 3090+RTX 3090 Ti); a large part of the VRAM isn't allocated (GPU0: 94%, GPU1: 66%). This is causing a torch.cuda.OutOfMemoryError for a modest length of text.

17:34:28-952126 INFO     Loading "Meta-Llama-3-70B-Instruct"                    
17:34:28-968657 INFO     TRANSFORMERS_PARAMS=                                   
{   'low_cpu_mem_usage': True,
    'torch_dtype': torch.bfloat16,
    'use_flash_attention_2': True,
    'device_map': 'auto',
    'max_memory': {0: '24200MiB', 1: '24200MiB', 'cpu': '99GiB'},
    'quantization_config': BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "float16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": true,
  "llm_int8_enable_fp32_cpu_offload": true,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}
}

Expected behavior

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacity of 23.68 GiB of which 77.69 MiB is free. Including non-PyTorch memory, this process has 22.66 GiB memory in use. Of the allocated memory 22.28 GiB is allocated by PyTorch, and 70.71 MiB is reserved by PyTorch but unallocated.

There are two issues with this:

It's a 24GB GPU therefore the total capacity shouldn't be 23.68GB (94% allocation)
The other GPU is used at 66%, it shouldn't face any issue using full context length

ArthurZucker commented 1 month ago

Hey! This seems more related to accelerate pinging @SunMarc and @muellerzr !

SunMarc commented 1 month ago

Hey @Skit5, thanks for reporting ! It seems from the traceback that this is the gpu 1 that goes oom. Could you try to change a few params in TRANSFORMERS_PARAMS to see if it behaves better ? I would suggest you to change device_map to "sequential" and try to tweak a bit the value of 'max_memory' ={0: '24200MiB', 1: '24200MiB'}, so that you don't experience oom ? We will try to change a bit how device_map='auto' behaves with bnb so that there is such unbalanced allocation.

github-actions[bot] commented 2 hours ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers