Ch 02: RuntimeError: CUDA out of memory with P100

MarcusFra commented 2 years ago

Information

The problem arises in chapter:

[ ] Introduction
[X] Text Classification
[ ] Transformer Anatomy
[ ] Multilingual Named Entity Recognition
[ ] Text Generation
[ ] Summarization
[ ] Question Answering
[ ] Making Transformers Efficient in Production
[ ] Dealing with Few to No Labels
[ ] Training Transformers from Scratch
[ ] Future Directions

Describe the bug

When running 02_classification.ipynb on Kaggle with a P100 GPU I receive a RuntimeError: CUDA out of memory after running cell 58:

#hide_output
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

To Reproduce

Steps to reproduce the behavior:

Run 02_classification.ipynb on Kaggle with GPU usage selected.

Stack trace (partially):

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_34/3668832236.py in <module>
      1 #hide_output
----> 2 emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

/opt/conda/lib/python3.7/site-packages/datasets/dataset_dict.py in map(self, function, with_indices, input_columns, batched, batch_size, remove_columns, keep_in_memory, load_from_cache_file, cache_file_names, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, desc)
    502                     desc=desc,
    503                 )
--> 504                 for k, dataset in self.items()
    505             }
    506         )

...
...
...

/opt/conda/lib/python3.7/site-packages/torch/nn/modules/sparse.py in forward(self, input)
    158         return F.embedding(
    159             input, self.weight, self.padding_idx, self.max_norm,
--> 160             self.norm_type, self.scale_grad_by_freq, self.sparse)
    161 
    162     def extra_repr(self) -> str:

/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2041         # remove once script supports set_grad_enabled
   2042         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2043     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
   2044 
   2045 

RuntimeError: CUDA out of memory. Tried to allocate 256.00 MiB (GPU 0; 15.90 GiB total capacity; 512.05 MiB already allocated; 167.75 MiB free; 530.00 MiB reserved in total by PyTorch)

Complete stack trace: Stacktrace_RuntimeError_ch2_NLP_Transformers.txt

GPU (nvidia-smi):

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Tue Mar 15 13:44:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.04   Driver Version: 450.119.04   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P0    35W / 250W |  16113MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Expected behavior

As metioned in the REAMDE.md I would have expected the P100 with its 16 GB to have enough gpu memory for the code being run without issues. I also tried to free up some cache with torch.cuda.empty_cache() but it did not suffice.

MarcusFra commented 2 years ago

Seems to be somewhat similiar to #26 (in which the same Error for chapter 4 i metioned).

EdwardJRoss commented 2 years ago

I'm running into the same issue. Reducing the batch size (from the default of 1000) solves it. It doesn't seem like the batch size impacts the processing time significantly; I'm finding 16 works.

emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True, batch_size=16)

But I'm not sure whether the training will still work.

EdwardJRoss commented 2 years ago

I've found even after making this change in Kaggle it runs out of CUDA memory when it gets to fine-tuning. It seems to work fine if you do just the fine-tuning (e.g. example notebook ).

I suspect the GPU memory needs to be cleared out after Extracting the Last Hidden States, but I can't work out how to do it (deleting all the objects and running torch.cuda.empty_cache() doesn't seem to solve it for me).

austinmw commented 2 years ago

Ran into the same issue

nlp-with-transformers / notebooks