lil-lab / newsroom

Tools for downloading and analyzing summaries and evaluating summarization systems. https://summari.es/
Other
146 stars 24 forks source link

Memory keep getting up when scraping data #13

Closed anhduc114 closed 3 years ago

anhduc114 commented 5 years ago

As I run the scraping code, the memory of my computer keeps getting up while downloading summaries until it crashes because the memory is not enough. Is there a way to fix it? Running on Windows 7 64bit, python version 3.6. capture

QiuJun1994 commented 5 years ago

I have encountered the same problem with using amazon AWS, which has 4GB memory

grusky commented 5 years ago

Thanks for submitting the issue! I've run into this problem as well in the past. The code in the Newsroom repo does not retain any data while downloading (it dumps it fairly quickly to the file), and I suspect this is a memory leak in one of the libraries being used — I have not figured out which one yet.

The script saves progress during downloading, so you can always stop and restart the script while downloading -- this is not very convenient, I know, but it's a temporary solution until I post a fix!

(If you email newsroom@summari.es, I can also share a full copy of the data!)

QiuJun1994 commented 5 years ago

Thank you for answering my question, I have downloaded the data successfully after stop and restart the script five times. I think it is a problem of memory leak.My AWS server has 4GB RAM, and every time when I downloaded about 20%, the program can't download any more.And then I can't fork any process because the memory is't enough.Therefore I killed and restarted it.

anhduc114 commented 5 years ago

Thanks for the answer. It seems that the issue only occurs in low RAM computers. I tried the script in a 8GB ram one and the memory consumption stablizes between 2.7GB to 3GB. Anyway, is there a link where I can downloads the extraction data? Going through the process takes too much time for me :(

grusky commented 5 years ago

Thanks, that's good to know. Though 3 GB is still much more memory than it needs to be using for the scraping — I'm looking into this. Send me an email (newsroom@summari.es) and I can share more data with you.