Open JingxinLee opened 2 years ago
Hi @JingxinLee, yes this chapter is a bit of a beast because we train so many XLM-R models in it. I don't think there exists an elegant solution beyond restarting the notebook, but what we can add is the intermediate checkpoints so that you can skip ahead to the training sections you're interested in.
One other thing you can try is reducing the batch_size
in the TrainingArguments
.
The first time I tried to run the whole notebook in Kaggle with a P100 (16GB) it ran out of CUDA memory partway through. But reducing the batch_size
from 24 to 16 I was able to get it all to run from end to end.
Information
The problem arises in chapter:
Describe the bug
RuntimeError: CUDA out of memory.
To Reproduce
Steps to reproduce the behavior:
So do you have a solution to deal with the CUDA OOM problem in Jupyter notebook?