Closed sanjaynagi closed 1 year ago
This makes sense. I'm still coming to grips with snakemake best practices. Does this mean that the conditional logic for the workflow i.e. whether whole genome or amplicon analysis and Bgzip will be in here as well?
Hi Sanjay,
Please look through my edits to common.smk. I have incorporated input functions for all inputs to rule all for your review. Also find below my edits to the main workflow file. Let me know if I can open a PR for the changes.
Hey Trevor. Thanks for this! I guess I should have been a bit clearer:
I was thinking something like the getDesiredOutputs function in the following -
https://github.com/sanjaynagi/rna-seq-pop/blob/master/workflow/Snakefile
https://github.com/sanjaynagi/rna-seq-pop/blob/master/workflow/rules/common.smk
So we just have one input function to rule all
. And ideally this will read relevant flags from the config.yaml. So for example, if someone wants to run coverage analyses, there is a setting in the config such as:
coverage:
activate: True
and within the function getDesiredOutputs(), you would have something like:
def getDesiredOutputs(wildcards):
wanted_input = []
wanted_input.extend(expand("results/vcfs/{dataset}_LSTM_merged.vcf", dataset=config['dataset']))
if config['coverage']['activate']:
if sequence_data == "amplicon":
wanted_input.extend(expand("results/coverage/{sample}.per-base.bed.gz", sample=samples))
if sequence_data == "wholegenome":
wanted_input.extend(expand("results/wholegenome/coverage/windowed/{sample}.regions.bed.gz", sample=samples))
return (wanted_input)
The extend
function just adds on the files to the wanted_input
list. For the minute there isn't many options we can put in the config, because we haven't made scripts to do with analysing the data, and we will always want to map and call genomic variants.
In general, I like to follow the practices that Johannes Koester, the author of snakemake, recommends:
https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#distribution-and-reproducibility
and if i need an example, tend to look at some of his example workflows, for example: https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/tree/main/workflow
There are so many ways one could structure a snakemake workflow, I found it useful to follow these guidelines. Although its worth noting, that not many other people do! :)
Hey Sanjay this is clearer to me now. I'll begin work on it and keep you posted.
Also thank you for the snakemake best practices resources, these will be very valuable to me down the line.
Awesome :)
One thing to remember (which I always forget), is to update the test/config/config.yamls when you change the normal config.
Yes, I'll be sure to keep those two up to date with each other.
Please find my modifications to common.smk, Snakefile and config down below.
config.yaml https://github.com/ChabbyTMD/AmpSeeker/blob/041a17aa756cd485ba74fc95cb9dd8ae41b4b97c/config/config.yaml
Awesome Trevor. Look great. happy if you would like to make a PR.
My only comments -
# this container defines the underlying OS for each job when using the workflow
# with --use-conda --use-singularity
singularity: "docker://continuumio/miniconda3"
Please could you delete the above in common.smk? we don't need it.
import pandas as pd
configfile: "config/config.yaml"
dataset=config['dataset']
metadata = pd.read_csv(config['metadata'], sep="\t")
samples = metadata['sampleID']
sequence_data = config['sequence_data']
And could you move the above into the Snakefile? probably right at the top. Thank you!
resolved with #11
We should keep the snakefile tidy, and so have an input function for rule all, which resides in a rule file called common.smk. This is good practice in snakemake.
This function will determine which output files we want to produce, based on the config.yaml