Feeding long(ish) documents leads to large memory allocation

kuchenrolle commented 4 years ago

I'm trying to use your newest model "pl_spacy_model_morfeusz_big" to parse some documents and I run into a memory error when the size of the documents grows too big. One document is about 4000 words big and this is the traceback:


Traceback (most recent call last):
  File "scripts/annotate.py", line 51, in <module>
    annotate_corpus(infile=CORPUS, outfile=ANNOTATED, drop_tags={})
  File "scripts/annotate.py", line 28, in annotate_corpus
    parsed = parser(" ".join(document))
  File "/home/kuchenrolle/miniconda3/envs/ndl/lib/python3.7/site-packages/spacy/language.py", line 430, in __call__
    doc = self.make_doc(text)
  File "/home/kuchenrolle/miniconda3/envs/ndl/lib/python3.7/site-packages/spacy/language.py", line 454, in make_doc
    return self.tokenizer(text)
  File "/home/kuchenrolle/miniconda3/envs/ndl/lib/python3.7/site-packages/pl_spacy_model_morfeusz_big/preprocessor/__init__.py", line 234, in __call__
    return self.process(text)
  File "/home/kuchenrolle/miniconda3/envs/ndl/lib/python3.7/site-packages/pl_spacy_model_morfeusz_big/preprocessor/__init__.py", line 203, in process
    tags = self.toygger.process(non_white_analysis, doc)
  File "/home/kuchenrolle/miniconda3/envs/ndl/lib/python3.7/site-packages/pl_spacy_model_morfeusz_big/preprocessor/Toygger/__init__.py", line 85, in process
    X_s[5] = zeros((len(data), MAX_WORDS, self.settings.WORD2VEC_DIM))
MemoryError: Unable to allocate 38.1 GiB for an array with shape (4126, 4126, 300) and data type float64

I can't really look into this more right now, but at first look it seems that it might be trying to allocate an array that is square in the size of the number of tokens (times the embedding size). If I split the document into chunks and run the parser on each chunk separately, it runs through fine.

ryszardtuora commented 4 years ago

Thank you for pointing this out. It appears to be a simple error regarding the tagger, I will fix this as soon as possible.

ryszardtuora commented 4 years ago

We're experimenting with some new features for the tagger, and most likely the fix will only be included with those, in the new version.

If you want a quick fix, you can modify one line in the model files: go to the directory where python modules are stored, and then into pl_spacy_model_morfeusz_big/preprocessor/Toygger/__init__.py

in line 85 you will see: X_s[5] = zeros((len(data), MAX_WORDS, self.settings.WORD2VEC_DIM))

Please change this to: X_s[5] = zeros((1, MAX_WORDS, self.settings.WORD2VEC_DIM)) i.e. substitute 1 for len(data). This solved the issue for me.

ipipan / spacy-pl

Feeding long(ish) documents leads to large memory allocation #8