rules to download genbank sourmash databases

In the Snakefile workflow, I use databases that are located on my local compute cluster, meaning I didn't need to download them as part of the workflow. I'm including rules to download them below:

The input paths for the sourmash gather rule would need to change in the Snakefile to reflect the output paths recorded below

rule download_sourmash_db_genbank_bacteria:
    output: "inputs/sourmash_dbs/genbank-2022.03-bacteria-k31.zip"
    threads: 1
    resources:
        mem_mb = 800,
        time_min = 30
    shell:'''
    wget -O {output} https://dweb.link/ipfs/bafybeigkcvizvhe3xzxsuzv3ryf3ogvgvcmms2e5nfk7epl5egts22jyue
    '''

rule download_sourmash_db_genbank_archaea:
    output: "inputs/sourmash_dbs/genbank-2022.03-archaea-k31.zip"
    threads: 1
    resources:
        mem_mb = 800,
        time_min = 30
    shell:'''
    wget -O {output} https://dweb.link/ipfs/bafybeidn6epju7yrdxrktq5wjko2yiwp6nrx3mq37htiuwecm7lffrbcdi
    '''

rule download_sourmash_db_genbank_fungi:
    output: "inputs/sourmash_dbs/genbank-2022.03-fungi-k31.zip"
    threads: 1
    resources:
        mem_mb = 800,
        time_min = 30
    shell:'''
    wget -O {output} https://dweb.link/ipfs/bafybeidhhwvwujkteno5ugwgjy4brhrv5dff2aumifcuew73qolfktdndq
    '''

rule download_sourmash_db_genbank_protozoa:
    output: "inputs/sourmash_dbs/genbank-2022.03-protozoa-k31.zip"
    threads: 1
    resources:
        mem_mb = 800,
        time_min = 30
    shell:'''
    wget -O {output} https://dweb.link/ipfs/bafybeicpxjhfrzem7f34eghbbwm3vglz2njxo72vpqcw7foilfomexsghi
    '''

rule download_sourmash_db_genbank_viral:
    output: "inputs/sourmash_dbs/genbank-2022.03-viral-k31.zip"
    threads: 1
    resources:
        mem_mb = 800,
        time_min = 30
    shell:'''
    wget -O {output} https://dweb.link/ipfs/bafybeibqsldwsztjf66rwvwnb6hamjtsfkmdk5bmfqbzwrod6wwwkqz2ya
    '''

GenBank and GTDB rs207 produced the same gather results (e.g. both missed viruses, no additional species in genbank), so leave workflow as is. Dumping genbank gather rule here for future

rule sourmash_gather_mgx_genbank:
    input:
        sig= "outputs/sourmash_sigs/{sample}.sig",
        db1 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-bacteria-k31.zip",
        db2 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-archaea-k31.zip", 
        db3 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-fungi-k31.zip",
        db4 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-protozoa-k31.zip",
        db5 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-viral-k31.zip",
    output: "outputs/sourmash_gather/{sample}_k31_scaled2000_genbank.csv"
    conda: "envs/sourmash.yml"
    benchmark: "benchmarks/sourmash_gather_k31_scaled2000_genbank_{sample}.tsv"
    threads: 1
    resources:
        mem_mb = 64000,
        time_min = 480
    shell:'''
    sourmash gather -o {output} -k 31 {input.sig} {input.db1} {input.db2} {input.db3} {input.db4} {input.db5}
    '''

taylorreiter / 2022-infant-mge

rules to download genbank sourmash databases #1