Open Skit5 opened 1 month ago
Hey! This seems more related to accelerate
pinging @SunMarc and @muellerzr !
Hey @Skit5, thanks for reporting ! It seems from the traceback that this is the gpu 1 that goes oom. Could you try to change a few params in TRANSFORMERS_PARAMS
to see if it behaves better ? I would suggest you to change device_map to "sequential" and try to tweak a bit the value of 'max_memory' ={0: '24200MiB', 1: '24200MiB'}, so that you don't experience oom ? We will try to change a bit how device_map='auto' behaves with bnb so that there is such unbalanced allocation.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers
version: 4.43.0.dev0Who can help?
text models: @ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
It occurs when using HF Transformers with Text-Generation WebUI (ticket opened here: https://github.com/oobabooga/text-generation-webui/issues/6028) but was reproduced on a simple loading test in a Jupyter Notebook using the base conda environment, therefore I open a ticket here as it's probably a HF Transformers issue.
The problem happens when a large model (Llama-3-70B with q4 parameters) gets loaded in a dual GPU (RTX 3090+RTX 3090 Ti); a large part of the VRAM isn't allocated (GPU0: 94%, GPU1: 66%). This is causing a torch.cuda.OutOfMemoryError for a modest length of text.
Expected behavior
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 1 has a total capacity of 23.68 GiB of which 77.69 MiB is free. Including non-PyTorch memory, this process has 22.66 GiB memory in use. Of the allocated memory 22.28 GiB is allocated by PyTorch, and 70.71 MiB is reserved by PyTorch but unallocated.
There are two issues with this: