softwaresaved / rse-repo-analysis

Study of research software in repositories. Contact: @karacolada
BSD 3-Clause "New" or "Revised" License
11 stars 0 forks source link

Fix out of memory #20

Closed karacolada closed 1 year ago

karacolada commented 1 year ago

Process in VM runs out of memory on ray.yorksj.ac.uk.

karacolada commented 1 year ago

Tested with ray.yorksj.ac.uk. Found that the process is killed always at the same link. Monitoring memory during the first half our of execution showed that normal memory consumption is at about 15%, so not too high. It also doesn't change much during execution - no spikes or longer term increases. Will isolate the script on the PDF processed when error occurs.

karacolada commented 1 year ago

The problem was a large file (PDF with embedded video, larger than 1GB) that was too big to be held in memory. Fixed by setting stream=True in when downloading the file (means it's not immediately downloaded) and checking that the content size is smaller than 0.5 GB before starting to parse (and thus download). I.e. PDFs larger than 0.5 GB are not processed anymore.