nlp-with-transformers / notebooks

Jupyter notebooks for the Natural Language Processing with Transformers book
https://transformersbook.com/
Apache License 2.0
3.87k stars 1.21k forks source link

CUDA out of memory #26

Open JingxinLee opened 2 years ago

JingxinLee commented 2 years ago

Information

The problem arises in chapter:

Describe the bug

RuntimeError: CUDA out of memory.

To Reproduce

Steps to reproduce the behavior:

  1. Run the 04_multilingual-ner.ipynb notebook in Colab free version or my own machine Jupyter Notebook(Both 11GB memory GPU)
  2. The cuda memory don't free up when finish the first trainer.train().
  3. I use torch.cuda.empty_cache(). Don't work. Kill the process in nvidia-smi, it will also kill the notebook and I have to re run the notebook from start.

So do you have a solution to deal with the CUDA OOM problem in Jupyter notebook?

lewtun commented 2 years ago

Hi @JingxinLee, yes this chapter is a bit of a beast because we train so many XLM-R models in it. I don't think there exists an elegant solution beyond restarting the notebook, but what we can add is the intermediate checkpoints so that you can skip ahead to the training sections you're interested in.

EdwardJRoss commented 2 years ago

One other thing you can try is reducing the batch_size in the TrainingArguments.

The first time I tried to run the whole notebook in Kaggle with a P100 (16GB) it ran out of CUDA memory partway through. But reducing the batch_size from 24 to 16 I was able to get it all to run from end to end.