Closed lstein closed 3 months ago
Thanks for reporting, I can replicate the issue as you described. Some further tests that I did:
AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
, memory is also not freed, whether with our without bnbThe last point made me wonder if the measurement is somehow incorrect. Adding some sleep time made no difference though.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I met the same problem but managed to get it work in this way:
del model.__dict__
before del model
Reproduction:
import gc
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
FULL_MODEL = "facebook/opt-125m"
print("* With unquantized model *")
model = AutoModelForCausalLM.from_pretrained(FULL_MODEL,
torch_dtype = torch.float16,
).to('cuda')
print('After loading, VRAM usage=',torch.cuda.memory_allocated())
referrers = gc.get_referrers(model)
print('Referrers = ',len(referrers))
del model
print('After model deletion, VRAM usage=:',torch.cuda.memory_allocated())
gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())
print("\n* With quantized model *")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(FULL_MODEL,
torch_dtype = torch.float16,
quantization_config=quantization_config,
)
print('After loading, VRAM usage=',torch.cuda.memory_allocated())
referrers = gc.get_referrers(model)
print('Referrers = ',len(referrers))
del model.__dict__ # <--- add this before `del model`
del model
print('After model deletion, VRAM usage=',torch.cuda.memory_allocated())
gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=',torch.cuda.memory_allocated())
Output:
* With unquantized model *
After loading, VRAM usage= 257405952
Referrers = 1
After model deletion, VRAM usage=: 0
After gc_collect and empty_cache, VRAM usage= 0
* With quantized model *
`low_cpu_mem_usage` was None, now set to True since model is quantized.
After loading, VRAM usage= 166252544
Referrers = 7
After model deletion, VRAM usage= 166252544
After gc_collect and empty_cache, VRAM usage= 0
Explaination: The root cause is circular reference.
Without quantization: load_in_8bit:
By del model.__dict__
, reference circle is cut, so gc.collect()
works.
Thanks for reporting, I can replicate the issue as you described. Some further tests that I did:
- with 2 GPUs, the memory is not freed, even w/o quantization
- 4bit or 8bit makes no difference
- When simply loading
AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
, memory is also not freed, whether with our without bnbThe last point made me wonder if the measurement is somehow incorrect. Adding some sleep time made no difference though.
This is probably due to circular reference of referrers = gc.get_referrers(model)
.
Reproduction (remove referrers
):
import gc
import torch
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
FULL_MODEL = "facebook/opt-125m"
print("* With unquantized model *")
model = AutoModelForCausalLM.from_pretrained(FULL_MODEL, torch_dtype=torch.float16, ).to('cuda')
print('After loading, VRAM usage=', torch.cuda.memory_allocated())
del model
print('After model deletion, VRAM usage=', torch.cuda.memory_allocated())
gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=', torch.cuda.memory_allocated())
print("\n* With quantized model *")
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForCausalLM.from_pretrained(FULL_MODEL, torch_dtype=torch.float16,
quantization_config=quantization_config, )
print('After loading, VRAM usage=', torch.cuda.memory_allocated())
del model
print('After model deletion, VRAM usage=', torch.cuda.memory_allocated())
gc.collect()
torch.cuda.empty_cache()
print('After gc_collect and empty_cache, VRAM usage=', torch.cuda.memory_allocated())
Output:
* With unquantized model *
After loading, VRAM usage= 257405952
After model deletion, VRAM usage= 0
After gc_collect and empty_cache, VRAM usage= 0
* With quantized model *
`low_cpu_mem_usage` was None, now set to True since model is quantized.
After loading, VRAM usage= 166252544
After model deletion, VRAM usage= 166252544
After gc_collect and empty_cache, VRAM usage= 0
The VRAM is released normally.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
I am a part of the InvokeAI development team (www.invoke.ai), and trying to provide support for the Stable Diffusion 3 text2image model. This task requires me to be able to sequentially load and unload portions of the generation pipeline into VRAM on CUDA systems.
After quantizing the HuggingFace model
T5EncoderModel
usingload_in_8bit
I cannot remove the model from VRAM. This appears to be related to an issue reported at https://github.com/huggingface/transformers/issues/21094. However, none of the proposed solutions are working for me. The following script illustrates the issue:Expected behavior
The output is:
The expected output is for the last line to read
VRAM usage=0
. In fact, when I comment out thequantization_config
parameter, the VRAM is indeed released.