CUDA out of memory - Githubissues

JingxinLee commented 2 years ago

Information

The problem arises in chapter:

[ ] Introduction
[ ] Text Classification
[ ] Transformer Anatomy
[x] Multilingual Named Entity Recognition
[ ] Text Generation
[ ] Summarization
[ ] Question Answering
[ ] Making Transformers Efficient in Production
[ ] Dealing with Few to No Labels
[ ] Training Transformers from Scratch
[ ] Future Directions

Describe the bug

RuntimeError: CUDA out of memory.

To Reproduce

Steps to reproduce the behavior:

Run the 04_multilingual-ner.ipynb notebook in Colab free version or my own machine Jupyter Notebook(Both 11GB memory GPU)
The cuda memory don't free up when finish the first trainer.train().
I use torch.cuda.empty_cache(). Don't work. Kill the process in nvidia-smi, it will also kill the notebook and I have to re run the notebook from start.

So do you have a solution to deal with the CUDA OOM problem in Jupyter notebook?

lewtun commented 2 years ago

Hi @JingxinLee, yes this chapter is a bit of a beast because we train so many XLM-R models in it. I don't think there exists an elegant solution beyond restarting the notebook, but what we can add is the intermediate checkpoints so that you can skip ahead to the training sections you're interested in.

EdwardJRoss commented 2 years ago

One other thing you can try is reducing the batch_size in the TrainingArguments.

The first time I tried to run the whole notebook in Kaggle with a P100 (16GB) it ran out of CUDA memory partway through. But reducing the batch_size from 24 to 16 I was able to get it all to run from end to end.

nlp-with-transformers / notebooks

CUDA out of memory #26

Information

Describe the bug

To Reproduce