Open taylorreiter opened 2 years ago
GenBank and GTDB rs207 produced the same gather results (e.g. both missed viruses, no additional species in genbank), so leave workflow as is. Dumping genbank gather rule here for future
rule sourmash_gather_mgx_genbank:
input:
sig= "outputs/sourmash_sigs/{sample}.sig",
db1 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-bacteria-k31.zip",
db2 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-archaea-k31.zip",
db3 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-fungi-k31.zip",
db4 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-protozoa-k31.zip",
db5 = "/group/ctbrowngrp/sourmash-db/genbank-2022.03/genbank-2022.03-viral-k31.zip",
output: "outputs/sourmash_gather/{sample}_k31_scaled2000_genbank.csv"
conda: "envs/sourmash.yml"
benchmark: "benchmarks/sourmash_gather_k31_scaled2000_genbank_{sample}.tsv"
threads: 1
resources:
mem_mb = 64000,
time_min = 480
shell:'''
sourmash gather -o {output} -k 31 {input.sig} {input.db1} {input.db2} {input.db3} {input.db4} {input.db5}
'''
In the Snakefile workflow, I use databases that are located on my local compute cluster, meaning I didn't need to download them as part of the workflow. I'm including rules to download them below:
The input paths for the
sourmash gather
rule would need to change in the Snakefile to reflect the output paths recorded below