Open kuchenrolle opened 4 years ago
Thank you for pointing this out. It appears to be a simple error regarding the tagger, I will fix this as soon as possible.
We're experimenting with some new features for the tagger, and most likely the fix will only be included with those, in the new version.
If you want a quick fix, you can modify one line in the model files:
go to the directory where python modules are stored, and then into pl_spacy_model_morfeusz_big/preprocessor/Toygger/__init__.py
in line 85 you will see:
X_s[5] = zeros((len(data), MAX_WORDS, self.settings.WORD2VEC_DIM))
Please change this to:
X_s[5] = zeros((1, MAX_WORDS, self.settings.WORD2VEC_DIM))
i.e. substitute 1 for len(data).
This solved the issue for me.
I'm trying to use your newest model "pl_spacy_model_morfeusz_big" to parse some documents and I run into a memory error when the size of the documents grows too big. One document is about 4000 words big and this is the traceback:
I can't really look into this more right now, but at first look it seems that it might be trying to allocate an array that is square in the size of the number of tokens (times the embedding size). If I split the document into chunks and run the parser on each chunk separately, it runs through fine.