Closed anhduc114 closed 3 years ago
I have encountered the same problem with using amazon AWS, which has 4GB memory
Thanks for submitting the issue! I've run into this problem as well in the past. The code in the Newsroom repo does not retain any data while downloading (it dumps it fairly quickly to the file), and I suspect this is a memory leak in one of the libraries being used — I have not figured out which one yet.
The script saves progress during downloading, so you can always stop and restart the script while downloading -- this is not very convenient, I know, but it's a temporary solution until I post a fix!
(If you email newsroom@summari.es, I can also share a full copy of the data!)
Thank you for answering my question, I have downloaded the data successfully after stop and restart the script five times. I think it is a problem of memory leak.My AWS server has 4GB RAM, and every time when I downloaded about 20%, the program can't download any more.And then I can't fork any process because the memory is't enough.Therefore I killed and restarted it.
Thanks for the answer. It seems that the issue only occurs in low RAM computers. I tried the script in a 8GB ram one and the memory consumption stablizes between 2.7GB to 3GB. Anyway, is there a link where I can downloads the extraction data? Going through the process takes too much time for me :(
Thanks, that's good to know. Though 3 GB is still much more memory than it needs to be using for the scraping — I'm looking into this. Send me an email (newsroom@summari.es) and I can share more data with you.
As I run the scraping code, the memory of my computer keeps getting up while downloading summaries until it crashes because the memory is not enough. Is there a way to fix it? Running on Windows 7 64bit, python version 3.6.![capture](https://user-images.githubusercontent.com/22145938/48964898-691eac80-efe4-11e8-9247-76fedb46ca30.PNG)