Closed amckenny closed 5 months ago
It's very strange that it is only allocating small amounts, but it's failing to go through the dataset. I did just find out that some of the models have default batch sizes which are not reasonable. What happens if you try the flag pos_batch_size=100
when creating the Pipeline
?
possible duplicate: https://github.com/stanfordnlp/stanza/issues/1370
Replicated with pos_batch_size=100
- same outcome.
The interesting thing is that nvidia-smi before running the program is confirmed at 0% memory usage and nearly 100% post. Is it possible that in bulk_process it doesn't release the GPU memory as it iterates such that the memory usage accumulates and never releases, resulting in death by a thousand cuts?
Jupyter
Chunks of 10 texts: 23%
69/300 [00:51<02:58, 1.29it/s]
---------------------------------------------------------------------------
OutOfMemoryError Traceback (most recent call last)
Cell In[5], line 86
84 for chunk in tqdm(chunks, desc="Chunks of 10 texts"):
85 torch.cuda.empty_cache()
---> 86 results = nlp.bulk_process(chunk)
87 for doc in results:
88 preprocessed.append(
89 [word.text.lower()
90 for sentence in doc.sentences
(...)
94 ]
95 )
File ~/.local/lib/python3.10/site-packages/stanza/pipeline/core.py:438, in Pipeline.bulk_process(self, docs, *args, **kwargs)
436 # Wrap each text as a Document unless it is already such a document
437 docs = [doc if isinstance(doc, Document) else Document([], text=doc) for doc in docs]
--> 438 return self.process(docs, *args, **kwargs)
File ~/.local/lib/python3.10/site-packages/stanza/pipeline/core.py:427, in Pipeline.process(self, doc, processors)
425 if self.processors.get(processor_name):
426 process = self.processors[processor_name].bulk_process if bulk else self.processors[processor_name].process
--> 427 doc = process(doc)
428 return doc
File ~/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py:258, in UDProcessor.bulk_process(self, docs)
255 combined_doc.num_tokens = sum(doc.num_tokens for doc in docs)
256 combined_doc.num_words = sum(doc.num_words for doc in docs)
--> 258 self.process(combined_doc) # annotations are attached to sentence objects
260 return docs
File ~/.local/lib/python3.10/site-packages/stanza/pipeline/pos_processor.py:85, in POSProcessor.process(self, document)
83 for i, b in enumerate(batch):
84 idx.extend(b[-1])
---> 85 preds += self.trainer.predict(b)
87 preds = unsort(preds, idx)
88 dataset.doc.set([doc.UPOS, doc.XPOS, doc.FEATS], [y for x in preds for y in x])
File ~/.local/lib/python3.10/site-packages/stanza/models/pos/trainer.py:72, in Trainer.predict(self, batch, unsort)
70 self.model.eval()
71 batch_size = word.size(0)
---> 72 _, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
73 upos_seqs = [self.vocab['upos'].unmap(sent) for sent in preds[0].tolist()]
74 xpos_seqs = [self.vocab['xpos'].unmap(sent) for sent in preds[1].tolist()]
File /N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.local/lib/python3.10/site-packages/stanza/models/pos/model.py:181, in Tagger.forward(self, word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
178 all_forward_chars = [self.charmodel_forward_transform(x) for x in all_forward_chars]
179 all_forward_chars = pack(pad_sequence(all_forward_chars, batch_first=True))
--> 181 all_backward_chars = self.charmodel_backward.build_char_representation(text)
182 if self.charmodel_backward_transform is not None:
183 all_backward_chars = [self.charmodel_backward_transform(x) for x in all_backward_chars]
File ~/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py:213, in CharacterLanguageModel.build_char_representation(self, sentences)
210 chars = get_long_tensor(chars, len(all_data), pad_id=vocab.unit2id(CHARLM_END)).to(device=device)
212 with torch.no_grad():
--> 213 output, _, _ = self.forward(chars, char_lens)
214 res = [output[i, offsets] for i, offsets in enumerate(char_offsets)]
215 res = unsort(res, orig_idx)
File ~/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py:155, in CharacterLanguageModel.forward(self, chars, charlens, hidden)
153 output, hidden = self.charlstm(embs, charlens, hx=hidden)
154 output = self.dropout(pad_packed_sequence(output, batch_first=True)[0])
--> 155 decoded = self.decoder(output)
156 return output, hidden, decoded
File /N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1501, in Module._call_impl(self, *args, **kwargs)
1496 # If we don't have any hooks, we want to skip the rest of the logic in
1497 # this function, and just call forward.
1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1499 or _global_backward_pre_hooks or _global_backward_hooks
1500 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501 return forward_call(*args, **kwargs)
1502 # Do not call functions when jit is used
1503 full_backward_hooks, non_full_backward_hooks = [], []
File /N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
113 def forward(self, input: Tensor) -> Tensor:
--> 114 return F.linear(input, self.weight, self.bias)
OutOfMemoryError: CUDA out of memory. Tried to allocate 226.00 MiB (GPU 0; 31.74 GiB total capacity; 696.91 MiB already allocated; 33.12 MiB free; 994.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
.py file
Preprocessing texts - this may take a while
Chunks of 10 texts: 0%| | 1/300 [00:02<10:56, 2.20s/it]
Traceback (most recent call last):
File "/N/slate/amckenny/class/ml.py", line 200, in <module>
main()
File "/N/slate/amckenny/class/ml.py", line 187, in main
results = nlp.bulk_process(chunk)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 438, in bulk_process
return self.process(docs, *args, **kwargs)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 427, in process
doc = process(doc)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py", line 258, in bulk_process
self.process(combined_doc) # annotations are attached to sentence objects
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/pos_processor.py", line 85, in process
preds += self.trainer.predict(b)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/trainer.py", line 72, in predict
_, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/model.py", line 181, in forward
all_backward_chars = self.charmodel_backward.build_char_representation(text)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 213, in build_char_representation
output, _, _ = self.forward(chars, char_lens)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 155, in forward
decoded = self.decoder(output)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 204.00 MiB (GPU 0; 31.74 GiB total capacity; 659.87 MiB already allocated; 99.12 MiB free; 928.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I spent some time verifying these things, and I think the POS only puts things on the GPU when it's about to process it. Indeed, the problem seems to be somewhere in a specific document range, rather than running out of GPU over a long run. If I run with a batch size of 250, it dies when processing a specific file from the training set (aclImdb/train/pos/10044_9.txt
), even if I only process that chunk. It works with a smaller batch size, though.
I don't think the tokenize_no_ssplit
is correct. It definitely looks like that file has a bunch of separate sentences in it. What happens is the file gets tokenized into a single sentence of ~2000 words, then everything batched with it at once becomes too memory intensive.
Arguably the POS should be able to handle that gracefully... there are a few options I can think of which might help
Good catch - I was actually playing around with that as well - when I run it with tokenize_no_ssplit=False
I get the same error.
I just tried again with nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos', tokenize_no_ssplit=False, use_gpu=True, pos_batch_size=100)
and get the following error:
Chunks of 10 texts: 1%|▎ | 4/300 [00:03<03:50, 1.28it/s]
Traceback (most recent call last):
File "/N/slate/amckenny/class/ml.py", line 200, in <module>
main()
File "/N/slate/amckenny/class/ml.py", line 187, in main
results = nlp.bulk_process(chunk)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 438, in bulk_process
return self.process(docs, *args, **kwargs)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/core.py", line 427, in process
doc = process(doc)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/processor.py", line 258, in bulk_process
self.process(combined_doc) # annotations are attached to sentence objects
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/pipeline/pos_processor.py", line 85, in process
preds += self.trainer.predict(b)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/trainer.py", line 72, in predict
_, preds = self.model(word, word_mask, wordchars, wordchars_mask, upos, xpos, ufeats, pretrained, word_orig_idx, sentlens, wordlens, text)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/pos/model.py", line 175, in forward
all_forward_chars = self.charmodel_forward.build_char_representation(text)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 213, in build_char_representation
output, _, _ = self.forward(chars, char_lens)
File "/N/u/amckenny/Quartz/.local/lib/python3.10/site-packages/stanza/models/common/char_model.py", line 155, in forward
decoded = self.decoder(output)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/N/soft/rhel8/deeplearning/Python-3.10.10/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 274.00 MiB (GPU 0; 31.74 GiB total capacity; 787.23 MiB already allocated; 153.12 MiB free; 874.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
My experience has been that with the smaller batch size, it is able to free up the memory used by the previous batch(es) in time to allocate the newer batch. It's possible there are hardwares or drivers where that doesn't work, though. When I read these files, I sorted them to make it easier to identify the file in question with the very long sentence... perhaps you could do something similar and see if there's another underlying issue? Meanwhile, I hope that today I'll be able to add a feature where it is a little more cautious about processing overly long sentences together in giant batches.
Thanks @AngledLuffa!
One of our HPC folks actually figured out what was going on. I was using Tensorflow earlier in the program and apparently the garbage collector wasn't clearing up the GPU memory before Torch was trying to grab it.
Adding this code before the bulk_process
method fixed the issue:
import gc
from numba import cuda
gc.collect()
cuda.select_device(0)
device = cuda.get_current_device()
device.reset()
Frustrating that TF and Torch can't learn to coexist - but hey - found a workaround!
I'm OK with closing this issue - but don't know if you want me to leave it open to attach the feature you're working on to.
Thanks again for your help!
This is now part of the 1.8.2 release
Describe the bug I'm receiving a CUDA OutOfMemory error when running relatively small batches of IMDB reviews. Curiously, the torch CUDA OOM error seems to be trying to allocate very small (e.g., 216MB) reservations when this happens. I'm running this on my university's HPC: Quartz with a V100 reserved to just my program, so it should be able to reserve up to 32GB of GPU memory.
To Reproduce Steps to reproduce the behavior:
Expected behavior I expected for the code to complete and the preprocessed text to be in a list for reincorporation into a Pandas DataFrame
Environment (please complete the following information):
Additional context
torch.cuda.empty_cache()
was added - it didn't change the outcome, this error occurs with or without that lineos.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:734"
nvidia-smi
before/after running in Jupyter: