Open BhushanDhamale opened 4 years ago
The main thing that comes to mind is that right now there's 127 plant assemblies, 333 fungal assemblies, 1,157 archaeal assemblies, and 200,357 bacterial assemblies in RefSeq. So there are massively more bacterial assemblies. The way ncbi-genome-download is built currently, we do keep some info on the downloads in memory while running. Because ncbi-genome-download wasn't really designed as a tool to just download all the things, I wasn't super careful with runtime memory usage, so chances are that you're just running out of memory while downloading all bacteria.
I have the same problem on my side
Executing this command
ncbi-genome-download bacteria --section refseq -l complete
results in Directories that just contain the MD5 file
ncbi-genome-download --formats fasta,assembly-stats --assembly-levels complete bacteria
resullts in the same problem
Hm, I don't think you should be running out of memory on a restricted download set like this. So much for that theory. Could you run one of your download commands with the added --debug
parameter and paste the last 10 lines or so of that run in here?
Executed command
ncbi-genome-download --formats fasta,assembly-stats --assembly-levels complete bacteria --debug &> log.log
@kblin Is this error reproducible on your side? If not I can try to dig into the python code myself, it looks pretty clean.
I can really post the last lines because it takes quite a while ... i.e. I have no idea how long my request would take.
ncbi-genome-download --genera "Vibrio fortis" --formats fasta,assembly-stats --assembly-levels complete bacteria --debug &> log.log
actually downloads "everything"
The full set of complete bacteria is a big download. I've just started a download with 12 parallel server connections, and extrapolating from the speed I'm getting the MD5SUMS
files in it'll take around 20 minutes to just get those, before I can even get started on downloading the sequence files.
I just noticed that I didn't release the progress bar changes that tell you about the MD5SUMS
download progress yet, I'll see what I can do about that.
Ahh ok, so I will first download all MD5SUMS
? I did not know that. Is this documented somewhere? That might be the problem then.
Is it acceptable to just do that many requests? I wrote my own shitty version of a refseq downloader before I found your program (I just used wget) and ran into the problem that after a while my wget
calls just stopped downloading stuff. I thought this is some kind of soft blocking due to the fact that I made so many requests
Hello. For the past week, I have been attempting to download protein fasta files for all bacteria using the following command:
ncbi-genome-download -F 'protein-fasta' -p 5 -r 3 -v 'bacteria'
This creates the directory structure as ./refseq/bacteria/GCF* containing only the MD5SUMS file in each directory. Strangely enough, the same command run for other groups (archaea, fungi, plants, etc.) runs just fine and downloads the desired .faa.gz files. What am I missing here?