Open yangw1234 opened 10 months ago
We are observing similar issue here with long-running SpeechT5 TTS models with some custom bells and whistles. The very same code running on the CUDA GPU is not a problem. Stable as a rock. Verified with both heaptrack and memray.
@yangw1234 were you able to find a solution?
@yangw1234 were you able to find a solution?
We just restart the finetuning process after CPU OOM, which, hopefully, is not very frequent.
Describe the bug
I found that the CPU memory increase happens when accelerate calls "loss.backward()" (https://github.com/huggingface/accelerate/blob/main/src/accelerate/accelerator.py#L1989) when doing LoRA finetuing on Intel GPU Max 1100.
Reproduce memory leak using ipex an transformers.
memory trend
finetune.py
Versions
Other relevant libraries: