summanlp / textrank

TextRank implementation for Python 3.
https://pypi.org/project/summa/
MIT License
1.25k stars 261 forks source link

memory leak #18

Open SolskGaer opened 6 years ago

SolskGaer commented 6 years ago

when passing a long text, like 20MB long, the script will soon run out of the computer's memory. My machine has 16GB of RAM, it takes about 5 minutes to freeze.

romanovzky commented 6 years ago

I notice this as well, huge memory leak that does not improve by invoking python's garbage collector

fbarrios commented 6 years ago

Hi! I will look this up, but the TextRank algorithm is actually designed to summarise articles. I don't know if results with 20MB input can be trusted.

romanovzky commented 6 years ago

Hi @fbarrios , I've noticed the memory leak if you attempt to summarise/extract keywords from many small documents. It seems that it might be retaining in memory things that it shouldn't (previous graphs maybe?). Attempt it yourself, get an extensive corpus of small articles (like news articles) and extract keywords from all of them, and you'll very quickly see the used memory jumping to +16GB. Cheers

Panaetius commented 5 years ago

Profiling with cProfile, when the issue occurs for me, execution seems to be stuck in graph.py:159:edge_weight(), with tons of calls. The relevant edge_weight call is in pagerank_weighted.py:50:build_adjacency_matrix().

My guess is that it is due to there being too many sentences and/or too many connections between sentences, and since the adjacency matrix runtime is O(n^2) relative to number of sentences, this can blow up if there's lots of sentences.

A quick, dirty fix is to limit the number of sentences to some maximum in the keywords() method, though of course then you won't be analyzing the whole text anymore.

Switching over to a sparse adjacency matrix and estimating likely candidates, as described on https://cran.r-project.org/web/packages/textrank/vignettes/textrank.html (In the MinHash section) might make it feasible for larger documents. But I don't know your codebase or the textrank algorithm well enough to implement it myself easily.