huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.93k stars 967 forks source link

Cannot free VRAM after loading a quantized model #2871

Closed lstein closed 3 months ago

lstein commented 4 months ago

System Info

- `Accelerate` version: 0.31.0
- Platform: Linux-5.15.0-79-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/lstein/test_ckpts/SD3/.venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 2.0.0
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 62.57 GB
- GPU type: NVIDIA GeForce RTX 4070
- `Accelerate` default config:
        Not found

Information

Tasks

Reproduction

I am a part of the InvokeAI development team (www.invoke.ai), and trying to provide support for the Stable Diffusion 3 text2image model. This task requires me to be able to sequentially load and unload portions of the generation pipeline into VRAM on CUDA systems.

After quantizing the HuggingFace model T5EncoderModel using load_in_8bit I cannot remove the model from VRAM. This appears to be related to an issue reported at https://github.com/huggingface/transformers/issues/21094. However, none of the proposed solutions are working for me. The following script illustrates the issue:

import gc
import torch
from transformers import T5EncoderModel, BitsAndBytesConfig
from accelerate.utils import release_memory

FULL_MODEL = 'stabilityai/stable-diffusion-3-medium-diffusers'

print("\n* With quantized model *")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = T5EncoderModel.from_pretrained(FULL_MODEL,
                                       torch_dtype = torch.float16,
                                       subfolder='text_encoder_3',
                                       quantization_config=quantization_config,
                                       low_cpu_mem_usage=True,
                                       device_map='auto',
                                       )
print('After loading, VRAM usage=',torch.cuda.memory_allocated())

referrers = gc.get_referrers(model)
print('Referrers = ',len(referrers))

release_memory(model)
model = None
print('After model deletion, VRAM usage=',torch.cuda.memory_allocated())

gc.collect()
torch.cuda.empty_cache()

print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())

Expected behavior

The output is:

* With quantized model *
After loading, VRAM usage= 7918596096
Referrers =  7
After model deletion, VRAM usage= 7918596096
After gc_collect and empty_cache, VRAM usage= 7918596096

The expected output is for the last line to read VRAM usage=0. In fact, when I comment out the quantization_config parameter, the VRAM is indeed released.

BenjaminBossan commented 4 months ago

Thanks for reporting, I can replicate the issue as you described. Some further tests that I did:

The last point made me wonder if the measurement is somehow incorrect. Adding some sleep time made no difference though.

github-actions[bot] commented 3 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

markyfsun commented 3 months ago

I met the same problem but managed to get it work in this way: del model.__dict__ before del model

Reproduction:

import gc
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM

FULL_MODEL = "facebook/opt-125m"

print("* With unquantized model *")
model = AutoModelForCausalLM.from_pretrained(FULL_MODEL,
                                       torch_dtype = torch.float16,
                                       ).to('cuda')

print('After loading, VRAM usage=',torch.cuda.memory_allocated())

referrers = gc.get_referrers(model)
print('Referrers = ',len(referrers))

del model
print('After model deletion, VRAM usage=:',torch.cuda.memory_allocated())

gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())

print("\n* With quantized model *")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(FULL_MODEL,
                                       torch_dtype = torch.float16,
                                       quantization_config=quantization_config,
                                       )
print('After loading, VRAM usage=',torch.cuda.memory_allocated())

referrers = gc.get_referrers(model)

print('Referrers = ',len(referrers))

del model.__dict__  # <--- add this before `del model`
del model
print('After model deletion, VRAM usage=',torch.cuda.memory_allocated())

gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())

Output:

* With unquantized model *
After loading, VRAM usage= 257405952
Referrers =  1
After model deletion, VRAM usage=: 0
After gc_collect and empty_cache, VRAM usage= 0
* With quantized model *
`low_cpu_mem_usage` was None, now set to True since model is quantized.
After loading, VRAM usage= 166252544
Referrers =  7
After model deletion, VRAM usage= 166252544
After gc_collect and empty_cache, VRAM usage= 0

Explaination: The root cause is circular reference.

Without quantization: backref2s load_in_8bit: backrefs

By del model.__dict__, reference circle is cut, so gc.collect() works.

markyfsun commented 3 months ago

Thanks for reporting, I can replicate the issue as you described. Some further tests that I did:

  • with 2 GPUs, the memory is not freed, even w/o quantization
  • 4bit or 8bit makes no difference
  • When simply loading AutoModelForCausalLM.from_pretrained("facebook/opt-125m"), memory is also not freed, whether with our without bnb

The last point made me wonder if the measurement is somehow incorrect. Adding some sleep time made no difference though.

This is probably due to circular reference of referrers = gc.get_referrers(model). Reproduction (remove referrers):

import gc

import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM

FULL_MODEL = "facebook/opt-125m"

print("* With unquantized model *")
model = AutoModelForCausalLM.from_pretrained(FULL_MODEL, torch_dtype=torch.float16, ).to('cuda')

print('After loading, VRAM usage=', torch.cuda.memory_allocated())

del model
print('After model deletion, VRAM usage=', torch.cuda.memory_allocated())

gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=', torch.cuda.memory_allocated())

print("\n* With quantized model *")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(FULL_MODEL, torch_dtype=torch.float16,
                                             quantization_config=quantization_config, )
print('After loading, VRAM usage=', torch.cuda.memory_allocated())

del model
print('After model deletion, VRAM usage=', torch.cuda.memory_allocated())

gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=', torch.cuda.memory_allocated())

Output:

* With unquantized model *
After loading, VRAM usage= 257405952
After model deletion, VRAM usage= 0
After gc_collect and empty_cache, VRAM usage= 0
* With quantized model *
`low_cpu_mem_usage` was None, now set to True since model is quantized.
After loading, VRAM usage= 166252544
After model deletion, VRAM usage= 166252544
After gc_collect and empty_cache, VRAM usage= 0

The VRAM is released normally.