CUDA OutOfMemory Error with bulk_process()

Describe the bug I'm receiving a CUDA OutOfMemory error when running relatively small batches of IMDB reviews. Curiously, the torch CUDA OOM error seems to be trying to allocate very small (e.g., 216MB) reservations when this happens. I'm running this on my university's HPC: Quartz with a V100 reserved to just my program, so it should be able to reserve up to 32GB of GPU memory.

To Reproduce Steps to reproduce the behavior:

Obtain dataset from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
Sample 3,000 texts from dataset

Spin up preprocessing pipeline

nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos', tokenize_no_ssplit=True, use_gpu=True, pos_batch_size=1000)

Attempt to preprocess text in batches:

    # Update the pandas dataframe with the tokenized reviews
    preprocessed = []
    chunks = np.array_split(test_data['review'], len(test_data)//10)
    print("Preprocessing texts - this may take a while")
    for chunk in tqdm(chunks, desc="Chunks of 10 texts"):
        torch.cuda.empty_cache()
        results = nlp.bulk_process(chunk)
        for doc in results:
            preprocessed.append(
                [word.text.lower()
                 for sentence in doc.sentences
                 for word in sentence.words
                 if word.upos not in ["PUNCT", "SYM", "NUM", 'X']
                 and word.text.lower() not in stops
                ]
            )
    test_data['review_tokens'] = preprocessed

See error

Preprocessing texts - this may take a while
Chunks of 10 texts:  20%|███████████████████████████████████▏                                                                                                                                               | 59/300 [00:40<02:45,  1.46it/s]
Traceback (most recent call last):
File "/N/slate/amckenny/class/ml.py", line 200, in <module>
main()
File "/N/slate/amckenny/class/ml.py", line 187, in main
results = nlp.bulk_process(chunk)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 438, in bulk_process
return self.process(docs, *args, **kwargs)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 427, in process
doc = process(doc)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py", line 258, in bulk_process
self.process(combined_doc) # annotations are attached to sentence objects
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/pos_processor.py", line 85, in process
preds += self.trainer.predict(b)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/trainer.py", line 72, in predict
_, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/model.py", line 181, in forward
all_backward_chars = self.charmodel_backward.build_char_representation(text)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 213, in build_char_representation
output, _, _ = self.forward(chars, char_lens)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 155, in forward
decoded = self.decoder(output)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 212.00 MiB (GPU 0; 31.74 GiB total capacity; 672.50 MiB already allocated; 81.12 MiB free; 948.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior I expected for the code to complete and the preprocessed text to be in a list for reincorporation into a Pandas DataFrame

Environment (please complete the following information):

OS: Linux
Python version: 3.10
Stanza version: 1.7.0

Additional context

I'm not running the CoreNLP Java server - just out-of-the-box Stanza
I've tried with other batch sizes and get the same outcome.
I've been working with the HPC team at my university - some other things we have tried
- The torch.cuda.empty_cache() was added - it didn't change the outcome, this error occurs with or without that line
- Added os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:734"
- This originally occured in a Jupyter Lab notebook - I got the error there and thinking that it may be because I was using it in an interactive session tried it again in a .py file - the error replicates in both contexts
- Examining nvidia-smi before/after running in Jupyter:
  - Before: Tesla V100-PCIE-32GB - GPU Memory Usage: 0
  - After: Tesla V100-PCIE-32GB - GPU Memory Usage: 32412MiB / 32768 MiB

It's very strange that it is only allocating small amounts, but it's failing to go through the dataset. I did just find out that some of the models have default batch sizes which are not reasonable. What happens if you try the flag pos_batch_size=100 when creating the Pipeline?

possible duplicate: https://github.com/stanfordnlp/stanza/issues/1370

Replicated with pos_batch_size=100 - same outcome.

The interesting thing is that nvidia-smi before running the program is confirmed at 0% memory usage and nearly 100% post. Is it possible that in bulk_process it doesn't release the GPU memory as it iterates such that the memory usage accumulates and never releases, resulting in death by a thousand cuts?

Jupyter

Chunks of 10 texts:  23%
 69/300 [00:51<02:58,  1.29it/s]

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[5], line 86
     84 for chunk in tqdm(chunks, desc="Chunks of 10 texts"):
     85     torch.cuda.empty_cache()
---> 86     results = nlp.bulk_process(chunk)
     87     for doc in results:
     88         preprocessed.append(
     89             [word.text.lower()
     90              for sentence in doc.sentences
   (...)
     94             ]
     95         )

File ~/.local/lib/python3.10/site-packages/stanza/pipeline/core.py:438, in Pipeline.bulk_process(self, docs, *args, **kwargs)
    436 # Wrap each text as a Document unless it is already such a document
    437 docs = [doc if isinstance(doc, Document) else Document([], text=doc) for doc in docs]
--> 438 return self.process(docs, *args, **kwargs)

File ~/.local/lib/python3.10/site-packages/stanza/pipeline/core.py:427, in Pipeline.process(self, doc, processors)
    425     if self.processors.get(processor_name):
    426         process = self.processors[processor_name].bulk_process if bulk else self.processors[processor_name].process
--> 427         doc = process(doc)
    428 return doc

File ~/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py:258, in UDProcessor.bulk_process(self, docs)
    255 combined_doc.num_tokens = sum(doc.num_tokens for doc in docs)
    256 combined_doc.num_words = sum(doc.num_words for doc in docs)
--> 258 self.process(combined_doc) # annotations are attached to sentence objects
    260 return docs

File ~/.local/lib/python3.10/site-packages/stanza/pipeline/pos_processor.py:85, in POSProcessor.process(self, document)
     83         for i, b in enumerate(batch):
     84             idx.extend(b[-1])
---> 85             preds += self.trainer.predict(b)
     87 preds = unsort(preds, idx)
     88 dataset.doc.set([doc.UPOS, doc.XPOS, doc.FEATS], [y for x in preds for y in x])

File ~/.local/lib/python3.10/site-packages/stanza/models/pos/trainer.py:72, in Trainer.predict(self, batch, unsort)
     70 self.model.eval()
     71 batch_size = word.size(0)
---> 72 _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
     73 upos_seqs = [self.vocab['upos'].unmap(sent) for sent in preds[0].tolist()]
     74 xpos_seqs = [self.vocab['xpos'].unmap(sent) for sent in preds[1].tolist()]

File /N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File ~/.local/lib/python3.10/site-packages/stanza/models/pos/model.py:181, in Tagger.forward(self, word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
    178     all_forward_chars = [self.charmodel_forward_transform(x) for x in all_forward_chars]
    179 all_forward_chars = pack(pad_sequence(all_forward_chars, batch_first=True))
--> 181 all_backward_chars = self.charmodel_backward.build_char_representation(text)
    182 if self.charmodel_backward_transform is not None:
    183     all_backward_chars = [self.charmodel_backward_transform(x) for x in all_backward_chars]

File ~/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py:213, in CharacterLanguageModel.build_char_representation(self, sentences)
    210 chars = get_long_tensor(chars, len(all_data), pad_id=vocab.unit2id(CHARLM_END)).to(device=device)
    212 with torch.no_grad():
--> 213     output, _, _ = self.forward(chars, char_lens)
    214     res = [output[i, offsets] for i, offsets in enumerate(char_offsets)]
    215     res = unsort(res, orig_idx)

File ~/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py:155, in CharacterLanguageModel.forward(self, chars, charlens, hidden)
    153 output, hidden = self.charlstm(embs, charlens, hx=hidden)
    154 output = self.dropout(pad_packed_sequence(output, batch_first=True)[0])
--> 155 decoded = self.decoder(output)
    156 return output, hidden, decoded

File /N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File /N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
    113 def forward(self, input: Tensor) -> Tensor:
--> 114     return F.linear(input, self.weight, self.bias)

OutOfMemoryError: CUDA out of memory. Tried to allocate 226.00 MiB (GPU 0; 31.74 GiB total capacity; 696.91 MiB already allocated; 33.12 MiB free; 994.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

.py file

Preprocessing texts - this may take a while
Chunks of 10 texts:   0%|                       | 1/300 [00:02<10:56,  2.20s/it]
Traceback (most recent call last):
  File "/N/slate/amckenny/class/ml.py", line 200, in <module>
    main()
  File "/N/slate/amckenny/class/ml.py", line 187, in main
    results = nlp.bulk_process(chunk)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 438, in bulk_process
    return self.process(docs, *args, **kwargs)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 427, in process
    doc = process(doc)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py", line 258, in bulk_process
    self.process(combined_doc) # annotations are attached to sentence objects
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/pos_processor.py", line 85, in process
    preds += self.trainer.predict(b)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/trainer.py", line 72, in predict
    _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
  File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/model.py", line 181, in forward
    all_backward_chars = self.charmodel_backward.build_char_representation(text)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 213, in build_char_representation
    output, _, _ = self.forward(chars, char_lens)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 155, in forward
    decoded = self.decoder(output)
  File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 204.00 MiB (GPU 0; 31.74 GiB total capacity; 659.87 MiB already allocated; 99.12 MiB free; 928.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I spent some time verifying these things, and I think the POS only puts things on the GPU when it's about to process it. Indeed, the problem seems to be somewhere in a specific document range, rather than running out of GPU over a long run. If I run with a batch size of 250, it dies when processing a specific file from the training set (aclImdb/train/pos/10044_9.txt), even if I only process that chunk. It works with a smaller batch size, though.

I don't think the tokenize_no_ssplit is correct. It definitely looks like that file has a bunch of separate sentences in it. What happens is the file gets tokenized into a single sentence of ~2000 words, then everything batched with it at once becomes too memory intensive.

Arguably the POS should be able to handle that gracefully... there are a few options I can think of which might help

Good catch - I was actually playing around with that as well - when I run it with tokenize_no_ssplit=False I get the same error.

I just tried again with nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos', tokenize_no_ssplit=False, use_gpu=True, pos_batch_size=100) and get the following error:

Chunks of 10 texts:   1%|▎                      | 4/300 [00:03<03:50,  1.28it/s]
Traceback (most recent call last):
  File "/N/slate/amckenny/class/ml.py", line 200, in <module>
    main()
  File "/N/slate/amckenny/class/ml.py", line 187, in main
    results = nlp.bulk_process(chunk)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 438, in bulk_process
    return self.process(docs, *args, **kwargs)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 427, in process
    doc = process(doc)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py", line 258, in bulk_process
    self.process(combined_doc) # annotations are attached to sentence objects
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/pos_processor.py", line 85, in process
    preds += self.trainer.predict(b)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/trainer.py", line 72, in predict
    _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
  File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/model.py", line 175, in forward
    all_forward_chars = self.charmodel_forward.build_char_representation(text)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 213, in build_char_representation
    output, _, _ = self.forward(chars, char_lens)
  File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 155, in forward
    decoded = self.decoder(output)
  File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 274.00 MiB (GPU 0; 31.74 GiB total capacity; 787.23 MiB already allocated; 153.12 MiB free; 874.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

My experience has been that with the smaller batch size, it is able to free up the memory used by the previous batch(es) in time to allocate the newer batch. It's possible there are hardwares or drivers where that doesn't work, though. When I read these files, I sorted them to make it easier to identify the file in question with the very long sentence... perhaps you could do something similar and see if there's another underlying issue? Meanwhile, I hope that today I'll be able to add a feature where it is a little more cautious about processing overly long sentences together in giant batches.

Thanks @AngledLuffa!

One of our HPC folks actually figured out what was going on. I was using Tensorflow earlier in the program and apparently the garbage collector wasn't clearing up the GPU memory before Torch was trying to grab it.

Adding this code before the bulk_process method fixed the issue:

import gc
from numba import cuda
gc.collect()
cuda.select_device(0)
device = cuda.get_current_device()
device.reset()

Frustrating that TF and Torch can't learn to coexist - but hey - found a workaround!

I'm OK with closing this issue - but don't know if you want me to leave it open to attach the feature you're working on to.

Thanks again for your help!

This is now part of the 1.8.2 release

stanfordnlp / stanza

CUDA OutOfMemory Error with bulk_process() #1372