metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
364 stars 97 forks source link

Reuse existing genome taxonomy database (gtdb) when installing conda environments #680

Closed jotech closed 1 year ago

jotech commented 1 year ago

The rule localrule download_gtdb is executed when using atas download even though the file gtdb_data.tar.gz already exists in the folder specified by --db-dir. This leads to unnecessary traffic and runtime because the gtdb is quite large.

Background: I removed the conda environments manually for debugging reasons and found that the download started again, although all necessary files (atlas/GTDB_V08_R214/gtdb_data.tar.gz) were available.

Describe the solution you'd like The rule download_gtdb should check whether the file is available and use the existing download whenever possible.

Additional context

localrule download_gtdb:
    output: [...]/dat/db/atlas/GTDB_V08_R214/gtdb_data.tar.gz
    log: logs/download/gtdbtk.log
    jobid: 6
    reason: Missing output files: [...]/dat/db/atlas/GTDB_V08_R214/gtdb_data.tar.gz; Code has changed since last execution
    resources: tmpdir=/tmp, time=10
SilasK commented 1 year ago

the tar.gz is only an intermediate file.

In theory, once the tar is downloaded, a rule extract_gtdb should extract it and create a flag os.path.join(GTDBTK_DATA_PATH, "downloaded_success") from there on no more data should be downloaded...

It is quite probable that the download was halve way done and created an incomplete tar.gz. Don't you think? In this cases I prefer to remove and restart the download.

SilasK commented 1 year ago

But thank you for suggesting improvements..