huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.32k stars 872 forks source link

Cannot free VRAM after loading a quantized model #2871

Open lstein opened 1 week ago

lstein commented 1 week ago

System Info

- `Accelerate` version: 0.31.0
- Platform: Linux-5.15.0-79-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /home/lstein/test_ckpts/SD3/.venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 2.0.0
- PyTorch version (GPU?): 2.3.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 62.57 GB
- GPU type: NVIDIA GeForce RTX 4070
- `Accelerate` default config:
        Not found

Information

Tasks

Reproduction

I am a part of the InvokeAI development team (www.invoke.ai), and trying to provide support for the Stable Diffusion 3 text2image model. This task requires me to be able to sequentially load and unload portions of the generation pipeline into VRAM on CUDA systems.

After quantizing the HuggingFace model T5EncoderModel using load_in_8bit I cannot remove the model from VRAM. This appears to be related to an issue reported at https://github.com/huggingface/transformers/issues/21094. However, none of the proposed solutions are working for me. The following script illustrates the issue:

import gc
import torch
from transformers import T5EncoderModel, BitsAndBytesConfig
from accelerate.utils import release_memory

FULL_MODEL = 'stabilityai/stable-diffusion-3-medium-diffusers'

print("\n* With quantized model *")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = T5EncoderModel.from_pretrained(FULL_MODEL,
                                       torch_dtype = torch.float16,
                                       subfolder='text_encoder_3',
                                       quantization_config=quantization_config,
                                       low_cpu_mem_usage=True,
                                       device_map='auto',
                                       )
print('After loading, VRAM usage=',torch.cuda.memory_allocated())

referrers = gc.get_referrers(model)
print('Referrers = ',len(referrers))

release_memory(model)
model = None
print('After model deletion, VRAM usage=',torch.cuda.memory_allocated())

gc.collect()
torch.cuda.empty_cache()

print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())

Expected behavior

The output is:

* With quantized model *
After loading, VRAM usage= 7918596096
Referrers =  7
After model deletion, VRAM usage= 7918596096
After gc_collect and empty_cache, VRAM usage= 7918596096

The expected output is for the last line to read VRAM usage=0. In fact, when I comment out the quantization_config parameter, the VRAM is indeed released.

BenjaminBossan commented 1 week ago

Thanks for reporting, I can replicate the issue as you described. Some further tests that I did:

The last point made me wonder if the measurement is somehow incorrect. Adding some sleep time made no difference though.