kblin / ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers
Apache License 2.0
956 stars 174 forks source link

High memory consumption (50 GB) for genbank bacteria #120

Open openpaul opened 4 years ago

openpaul commented 4 years ago

I was using ncbi-genome-download to fetch all genbank bacteria (currently 630k). This seems to be a challenging task, I am unsure if this is even feasible this way. I noticed it will require up to 50GB of memory and wondered how that can be. I am currently debugging the tool to figure it out, but thought maybe it is known to someone. This would save me some trouble. I am launching it like this:

ncbi-genome-download -F fasta -s genbank     \
    --human-readable        \
    --retries 2     \
    --parallel 4    \
    --no-cache     \
    --verbose     \
    -o genomes     \
    bacteria
kblin commented 4 years ago

Hi, at what point is the memory usage up that high? I don't currently have the disk space or internet connection to really try downloading all of bacterial GenBank, but at least during the selection and MD5SUMS downloading step the memory usage is driven by your --parallel value and doesn't go above 2 GB for me with your input parameters.

Wrzlprmft commented 4 years ago

I observe memory issues as well. Here is what I did:

All of this screams memory leak to me.

Wrzlprmft commented 4 years ago

Running something similar again, I noticed that NCBI Genome Download appears to be spawning more Python processes, which each consume a bit of memory and amass over time. Their number exceeds the value of the parallel argument.

openpaul commented 4 years ago

Hi, so I actually finished downloading genbank bacteria. In the successfull run it actually used "only" 20 Gb of memory in 4 threads, so 5 Gb per thread, which is reasonable given the number of files. Still I assume this could be lowered. Here a plot from mprofile:

Screenshot_2020-05-18_11-13-39

Edit: I am not sure why it used only 20 Gb of memory this time where last time it requested more than that (50Gb)

I looked at the code and it seems a strange to me to first collect all jobs and then work them off, instead of launching them directly (in core.py::config_download). List download_jobs is created and used once in the next loop. Maybe thats the issue? But I guess a more thorough code profiling has to be done to really find where the memory stacks up.

Wrzlprmft commented 4 years ago

5 Gb per thread, which is reasonable given the number of files

Why would a downloading job require that much memory?

openpaul commented 4 years ago

Its over 600.000 files, all of which require multiple informations in memory. So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable. Or does it not? Still, as mentioned, I think the logic could be improved upon, reducing memory consumption. Right now I don't have the time to suggest and test alternatives, but maybe in a week or two I might be able to.

Wrzlprmft commented 4 years ago

So per download candidate its only 20Gb/630.000, which is 31Kb, which itself seems reasonable.

Not really. NCBI stores file information in assembly summaries, most of which are below 1 kB and even then, you only need a fraction of this for downloading. While these files can be much larger (Megabytes) for some species like E. coli, there is no reason to keep all of this information for all species in memory all the time.

kblin commented 4 years ago

ncbi-genome-download was built to allow intelligent filtering of files, not to just go grab all of them, and only the large file downloads were run in parallel, which is why there's this collect and execute split. If you don't use the tool to download all the things at once, the memory consumption is very reasonable, and that's what I've built it for. I'm happy to look at patches to restructure things, but it's not trivial and if you don't really need the filtering, a simpler script parsing the assembly_summary.txt file and then just downloading all those files might serve your needs better.

kblin commented 4 years ago

I mean I agree that the tool could be more careful with dumping things from memory again, but it was build for workflows like "get me all assemblies of genus Streptomyces that have an assembly level of 'complete' or 'chromosome' from GenBank", and not "download all 630000 bacterial genomes on the server".