Closed JamesBowerXanda closed 5 months ago
Hi @JamesBowerXanda, thanks for raising this issue and providing a script and the environment info.
Could you provide some more information about the memory consumption? Ideally we'd see some kind of graph with it changing overtime.
cc @muellerzr @pacman100 @ylacombe @sanchit-gandhi
Hi, it may be that I was a bit hasty raising this. I was using the Activity Monitor on the mac to check the memory usage and whilst it has gone up to 73GB for the process the script does seem to be still running and their is only 32 GB of Physical memory on the machine so it might just be that I am misunderstanding something in the activity monitor or there is something strange going on in the process memory consumption calculation.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@JamesBowerXanda Do you still have the problem? Did you figure out why? cc: @amyeroberts
Please check the screenshot above, I am using https://github.com/huggingface/alignment-handbook
to train LLM, what I observed from wandb log is that it keeps increasing system memory usage and when it reaches 96% ish, the training will crash.
@xiyang-aads-lilly I thought it was something bugging out with activity monitor but it turned out it wasn't. I actually opened another issue here but still haven't gotten to the bottom of the issue.
@xiyang-aads-lilly I thought it was something bugging out with activity monitor but it turned out it wasn't. I actually opened another issue here but still haven't gotten to the bottom of the issue.
Thanks for the reply and point me to the latest issue!
I saw the suggestions on torch_empty_cache_steps
, I will give a try on that first.
@xiyang-aads-lilly I don't see how it could hurt to add it in as well. I am completely stumped on what to do about it. I get the impression that it is going to be attributed to a lower level problem with Pytorch on mps though so won't be fixed through this forum. The issue is we are not sure which part of the trainer is causing the issue so it is hard to raise an issue on torch.
System Info
transformers
version: 4.39.3Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I am trying to finetune the SpeechT5ForTextToSpeech model on the "lj_speech" dataset. I am using the Seq2SeqTrainer class to do this. My configuration is:
For some reason the memory consumption is constantly increasing throughout the training run. It starts with a memory consumption of 27GB for the first few steps of training and by step 250 it has hit 49.16GB. No evaluations have been done to this point. It is my understanding that the memory footprint should not be constantly increasing after each step. Could anyone explain to me why this is happening.
Below is a full copy of the script:
Expected behavior
Memory consumption to be approximately constant during the training process.