kblin / ncbi-genome-download

Scripts to download genomes from the NCBI FTP servers
Apache License 2.0
952 stars 175 forks source link

.faa.gz files not being downloaded for bacteria #136

Open BhushanDhamale opened 4 years ago

BhushanDhamale commented 4 years ago

Hello. For the past week, I have been attempting to download protein fasta files for all bacteria using the following command: ncbi-genome-download -F 'protein-fasta' -p 5 -r 3 -v 'bacteria' This creates the directory structure as ./refseq/bacteria/GCF* containing only the MD5SUMS file in each directory. Strangely enough, the same command run for other groups (archaea, fungi, plants, etc.) runs just fine and downloads the desired .faa.gz files. What am I missing here?

kblin commented 4 years ago

The main thing that comes to mind is that right now there's 127 plant assemblies, 333 fungal assemblies, 1,157 archaeal assemblies, and 200,357 bacterial assemblies in RefSeq. So there are massively more bacterial assemblies. The way ncbi-genome-download is built currently, we do keep some info on the downloads in memory while running. Because ncbi-genome-download wasn't really designed as a tool to just download all the things, I wasn't super careful with runtime memory usage, so chances are that you're just running out of memory while downloading all bacteria.

Unaimend commented 1 year ago

I have the same problem on my side Executing this command ncbi-genome-download bacteria --section refseq -l complete

results in Directories that just contain the MD5 file

image

Unaimend commented 1 year ago

ncbi-genome-download --formats fasta,assembly-stats --assembly-levels complete bacteria resullts in the same problem

kblin commented 1 year ago

Hm, I don't think you should be running out of memory on a restricted download set like this. So much for that theory. Could you run one of your download commands with the added --debug parameter and paste the last 10 lines or so of that run in here?

Unaimend commented 1 year ago

log.log

Executed command ncbi-genome-download --formats fasta,assembly-stats --assembly-levels complete bacteria --debug &> log.log

@kblin Is this error reproducible on your side? If not I can try to dig into the python code myself, it looks pretty clean.

I can really post the last lines because it takes quite a while ... i.e. I have no idea how long my request would take.

Unaimend commented 1 year ago

ncbi-genome-download --genera "Vibrio fortis" --formats fasta,assembly-stats --assembly-levels complete bacteria --debug &> log.log actually downloads "everything" image

kblin commented 1 year ago

The full set of complete bacteria is a big download. I've just started a download with 12 parallel server connections, and extrapolating from the speed I'm getting the MD5SUMS files in it'll take around 20 minutes to just get those, before I can even get started on downloading the sequence files.

I just noticed that I didn't release the progress bar changes that tell you about the MD5SUMS download progress yet, I'll see what I can do about that.

Unaimend commented 1 year ago

Ahh ok, so I will first download all MD5SUMS? I did not know that. Is this documented somewhere? That might be the problem then.
Is it acceptable to just do that many requests? I wrote my own shitty version of a refseq downloader before I found your program (I just used wget) and ran into the problem that after a while my wget calls just stopped downloading stuff. I thought this is some kind of soft blocking due to the fact that I made so many requests