muellan / metacache

memory efficient, fast & precise taxnomomic classification system for metagenomic read mapping
GNU General Public License v3.0
57 stars 12 forks source link

speedup refseq downloads - suggestion not issue #5

Closed accopeland closed 6 years ago

accopeland commented 6 years ago

Downloading from NCBI using ascp is much, much faster than wget. Changes something like this to your download script will speed it up by a lot:

 FTPURL="ftp://ftp.ncbi.nih.gov/genomes"
wget $FTPURL/refseq/bacteria/assembly_summary.txt
# complete genomes
 awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary.txt > ftpdirpaths ;;
awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths | sed 's@ftp://ftp.ncbi.nlm.nih.gov@@' > ftpfilepaths
# ascp instead of wget
ascp -T -k2 -l 1000M -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh --user=anonftp --file-list=ftpfilepaths --mode=recv --host=ftp.ncbi.nlm.nih.gov . 
muellan commented 6 years ago

Thanks for the suggestion. I know the genomes download is really slow, but wget is a GNU core utility and you can rely on it beeing present on any GNU/Linux platform. Maybe I should check for ascp presence in the scripts and use it if it is available.

accopeland commented 6 years ago

I certainly understand the desire to avoid messy dependencies. To keep within the GNU-verse you could consider either

cat ftpfilepaths | xargs -P 16 -n 1 wget -P genomes -c -nv

or

cat ftpfilepaths | parallel -j 16 "wget -P genomes -c -nv {} "

muellan commented 6 years ago

You are right. The downloads should at least be done in parallel. I will change that.