Closed joawar closed 3 years ago
I tried adding
del training_args
del trainer
del model_config
del run
gc.collect()
th.cuda.empty_cache()
to the end of each loop, but it does not seem to change anything.
I think the memory problem comes from the wandb integration. I do not see the problem without it: memory resets at 0 at each new step of the loop and goes back to the same max value.
Use torc with no grad inside for loop
Seems like the same problem occurs with wandb's sweeps, so it looks like a wandb problem more than a huggingface one. I can't use wandb then, sucks :/
cc @borisdayma so you are aware.
The problem is that GPU memory allocated accumulates for each run. This eventually results in a
RuntimeError: CUDA out of memory
error. You can see the wandb GPU memory allocated, produced by the code below, here: wandbI had the same problem when using Trainer's built in hyperparameter_search, which also runs training in a loop I assume. Similar issues from the past are: https://github.com/huggingface/transformers/issues/1742 https://github.com/huggingface/transformers/issues/1134 https://gitmemory.com/issue/huggingface/transformers/9929/770965726
Environment info
transformers
version: 4.4.2Who can help
Library:
Information
Model I am using (Bert, XLNet ...):
BertForSequenceClassification.from_pretrained('bert-base-cased')
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
This code reproduces the error.
Expected behavior
The loops runs without memory accumulating for each run.