Closed liyucheng09 closed 3 years ago
I'm quite puzzled too, to be honest. I know that sometimes, PyTorch will trigger a CUDA OOM error even if there is enough memory in theory just because it's not able to find a contiguous chunk or has some leftovers for some reason, exactly like what your message suggests (22.53GB allocated but 23.21GB reserved by PyTorch). I don't have any suggestion apart from trying the usual strategies to lower a bit the memory footprint (slightly lower the batch size or block size).
@sgugger Appreciate your reply! I am wondering that can I resume the training processing if I change the batch size or block size of the training args. I have no idea whether it will fit the saved schedule or optimizer parameters.
@sgugger Appreciate your reply! I am wondering that can I resume the training processing if I change the batch size or block size of the training args. I have no idea whether it will fit the saved schedule or optimizer parameters.
你好,请问你解决了这个问题了吗
@xinjicong Not yet. If you have some ideas, please shares.
@xinjicong Not yet. If you have some ideas, please shares.
i try to make max_seq_length smaller but it can't not work.
@xinjicong Not yet. If you have some ideas, please shares.
我检查了代码,发现是我在使用tokenizer的时候,出现了问题。tokenizer输出的维度多了一维,然后后面batch的时候就出错了。
I observe the same issue, if I train a model, save a checkpoint and reload from this, I get memory issues for the code which was training fine before.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Same Issue
+1
I have this issue as well. Model trains for 1 epoch and goes through validation step, then I get OOM somewhere in the second epoch. These are large models I am training and I often get OOM after it has been training for a couple of hours.
@dinsausti-vir Try reducing validation batch size to 1. I'm not sure how I fixed the error but batch size is usually the cause for OOM
@perceptiveshawty Thanks for the tip. I will give that a shot!
Environment info
transformers
version: 4.1.1Who can help
Information
Model I am using (Bert, XLNet ...): GPT2
The problem arises when using:
The tasks I am working on is:
To reproduce
The strange thing is that the scripts runs ok in the first 12 epochs, and ends with error in the middle of 12 epochs. I have checked that the trainer doesn't cache training loss tensor, so I am quite puzzled by the error. Any help are highly appreciated.
Steps to reproduce the behavior:
python run_clm.py config.json
Several useful config in
config.json
are:Model Config are:
The tokenizer used is
BertTokenizer.from_pretrained('Bert-base-chinese')
.The error log are following: