ropensci / biomartr

Genomic Data Retrieval with R
https://docs.ropensci.org/biomartr
217 stars 29 forks source link

High RAM usage when using meta.retrieval prevents usage on low end machines #40

Open barrel0luck opened 5 years ago

barrel0luck commented 5 years ago

This issue depends on the numbers of genomes/ cds/ rna/ etc that are to be downloaded. As the process goes on the amount of RAM used by the r-session progressively increases to a point where the entire system slows down. On a system with 8 GB RAM (low-end systems), I've managed to download ~1600 successfully, but I need to restart system after the process is done as everything is super slow afterwards. So far I've not managed to download higher numbers than that as the system becomes non responsive. I think it must be some variable that must be increasing in size as the process goes on and can be easily cleaned up after each download (maybe) to reduce RAM usage. Also note that the RAM usage at times intermittently decreases, i.e, it's not continuously increasing, but over a long period of time, it increases a lot, eventually overpowering the system.

HajkD commented 5 years ago

Hi @barrel0luck,

Many thanks for contacting me and for making me aware of this issue.

Would you mind sharing a small example where this occurs? This will make my life much easier when troubleshooting.

Your help is very much appreciated.

Many thanks, Hajk

barrel0luck commented 5 years ago

Sure! And thanks for developing this awesome package! I hope you can maintain it for long!

Here's the code you can use to reproduce the issue on a low end system (no biggie): This should download ~1600 files:

meta.retrieval(kingdom = "bacteria", group = "Gammaproteobacteria", db = "refseq", type = "rna", reference = FALSE) %>%
  clean.retrieval()

This should dowload a much greater number of files (not sure about the number, I've failed so far):

meta.retrieval(kingdom = "bacteria", group = "Gammaproteobacteria", db = "genbank", type = "rna", reference = FALSE) %>%
  clean.retrieval()
HajkD commented 5 years ago

Perfect! Thank you so much :-)

I will have a look at it now.

Cheers, Hajk

barrel0luck commented 5 years ago

I must note that I'm using R on Fedora Linux. However, if the issue is with the code, maybe a variable (or more) that grows with each iteration of a loop, then it should be reproducible on other OSes as well... I think the problem results as r-session loads and stores everything in the RAM.