pirovc / genome_updater

Bash script to download/update snapshots of files from NCBI genomes repository (refseq/genbank) with track of changes and without redundancy
MIT License
139 stars 14 forks source link

Failed to download gtdb taxonomy #77

Closed ohickl closed 1 year ago

ohickl commented 1 year ago

Hi, I wanted to make use of this very convenient tool of yours to get all the GTDB genomes for https://github.com/pirovc/ganon/issues/227, but I almost immediately get:

-------------------------------------------
┌─┐┌─┐┌┐┌┌─┐┌┬┐┌─┐    ┬ ┬┌─┐┌┬┐┌─┐┌┬┐┌─┐┬─┐
│ ┬├┤ ││││ ││││├┤     │ │├─┘ ││├─┤ │ ├┤ ├┬┘
└─┘└─┘┘└┘└─┘┴ ┴└─┘────└─┘┴  ─┴┘┴ ┴ ┴ └─┘┴└─
                                     v0.5.1
-------------------------------------------
Mode: NEW
Args: -M 'gtdb' -d 'refseq,genbank' -f 'genomic.fna.gz,assembly_report.txt' -g 'archaea,bacteria' -o '.../gtdb_genomes' -t '60' -m -r
Outp: .../gtdb_genomes/
-------------------------------------
Downloading assembly summary [2023-01-12_15-21-16]
 - Database [refseq,genbank]
 - Organism group [archaea,bacteria]
 - 1797128 assembly entries available

Filtering assembly summary [2023-01-12_15-21-16]
 - Downloading taxonomy (gtdb)
 - Failed to download https://data.gtdb.ecogenomic.org/releases/release207/207.0/ar53_taxonomy_r207.tsv.gz. Trying again #2
 - Failed to download https://data.gtdb.ecogenomic.org/releases/release207/207.0/ar53_taxonomy_r207.tsv.gz. Trying again #3
 - Failed

I can download the the file just fine manually (wget) with matching md5sum.

Also tried just copying the example from the readme, getting the same result.

Best

Oskar

pirovc commented 1 year ago

If the url still exists but the script is failing to get it, it's probably a network error. Some things you can try:

ohickl commented 1 year ago

Tried all of them on different cluster nodes and with a local macos and a linux system. Same results unfortunately. Might be a problem with the network but since I can download it manually, I am not so sure.

pirovc commented 1 year ago

I found the issue, the repository MD5SUM file used to check the integrity of the download now has a ".txt". Adding that to the following line (https://data.gtdb.ecogenomic.org/releases/release207/207.0/MD5SUM.txt) should solve the problem. I'll double check which version is correct and release a fix soon.

https://github.com/pirovc/genome_updater/blob/a512fae629fb0732d1b9e9c4837968dcbcd23f3d/genome_updater.sh#L1482

pirovc commented 1 year ago

If you installed via conda, you can find the script with whereis genome_updater.sh and run the following to fix it:

sed -i 's|https://data.gtdb.ecogenomic.org/releases/release207/207.0/MD5SUM|https://data.gtdb.ecogenomic.org/releases/release207/207.0/MD5SUM.txt|g' /path/to/your/genome_updater.sh

ohickl commented 1 year ago

Works now, thanks!

pirovc commented 1 year ago

Fixed with v0.5.2 #78