Closed Deno-V closed 1 week ago
Thanks for reporting and providing a reproducer. I ran it on my machine and these are my logs (omitting fluff):
::::Before Loading Model, allocated GPU memory: 0
::::After Loading Model, allocated GPU memory: 6115435008
{'loss': 2.8368, 'grad_norm': nan, 'learning_rate': 0.0002, 'epoch': 0.5}
{'loss': 3.83, 'grad_norm': nan, 'learning_rate': 0.0002, 'epoch': 1.0}
{'train_runtime': 0.5141, 'train_samples_per_second': 3.89, 'train_steps_per_second': 3.89, 'train_loss': 3.3334261178970337, 'epoch': 1.0}
::::After Training Model, allocated GPU memory: 6132475392
::::After Free Memory, allocated GPU memory: 17039360
So for me, there is also a memory left after clearing the cache, but only a little, whereas for you, it's basically the same as before clearing. I'm not sure what's going on here. Could you try updating to the latest versions of PEFT, transformers, trl, accelerate, and torch?
thanks, problem solved.
I updated these packages and find it's due to transformers (4.46.1) After I update it to 4.46.3. This problem is solved. (Same results as you got)
I spend half a day on it and never think about it is transformers... hahaha : (
i tried a non-peft model before and not notice this problem. So I wrongly take it as a bug of PEFT. Sorry~
I'm glad that this solved the issue for you.
i tried a non-peft model before and not notice this problem. So I wrongly take it as a bug of PEFT. Sorry~
It could be some strange interaction between PEFT and transformers that's causing it. As this is now patched though, I don't think it's worth it to investigate further. I'll close the PR, but if anything new comes up, feel free to re-open.
System Info
peft 0.13.2 accelerate 1.1.0 torch 2.4.0 trl 0.12.0 python 3.10.15 linux server
Who can help?
@BenjaminBossan @sayakpaul
Information
Tasks
examples
folderReproduction
Expected behavior
I expect the code should print the following GPU memory allocation: Before loading model, the allocated memory should be 0 After loading model, the allocated memory should be 6115435008 After training model, the allocated memory should be slightly higher: 6115435008+? After empty_cache() and garbage collecting, the allocated memory should be very low: 0~5000 (maybe?)
However, the code prints the results: ::::Before Loading Model, allocated GPU memory: 0 ::::After Loading Model, allocated GPU memory: 6115435008 ::::After Training Model, allocated GPU memory: 6132475392 ::::After Free Memory, allocated GPU memory: 6132474368
From the results, I see that I can not free the memory after the train()
I have done several tries, if I do not call train(), the GPU memory can be freed normally.
I have to do further procedure after training. But this memory consumption will accumulate. If I call this train() n times. The allocated memory will grow n times. BAD!
How can I free this memory? I even think this is a severe bug.