phHartl / eu-judgement-analyse

Quantitative analysis of judgments of the European Court of Justice
MIT License
6 stars 0 forks source link

Initialising CorpusAnalysis pipeline allocates too much memory #46

Closed thomfischer closed 3 years ago

thomfischer commented 3 years ago

When trying to call CorpusAnalysis.init_pipeline() on the entire corpus, self.corpus = textacy.Corpus(self.nlp, data=texts) allocates 12GB of memory within 1 minute. This must absolutely be fixed if possible.

phHartl commented 3 years ago

There seems to be a known memory leak in spaCy 2.1.8 (https://github.com/explosion/spaCy/issues/3618), which has only been fixed with v 2.1.9

phHartl commented 3 years ago

I will try to update blackstone manually to spaCy 2.2 for multi-core pipeline support and fixed memory leaks (see #50 ).

phHartl commented 3 years ago

To reduce the memory footprint significantly, we additionally need to know which pipeline components are currently needed, so we can disable unnecessary parts (e.g. NER detection, when no NER detection is wanted see #51)

phHartl commented 3 years ago

Updated requirements to spacy 2.1.9 ( 56e3e36) which fixed the memory leak present in spacy 2.1.8.