move input file list to common.smk

sanjaynagi commented 2 years ago

We should keep the snakefile tidy, and so have an input function for rule all, which resides in a rule file called common.smk. This is good practice in snakemake.

This function will determine which output files we want to produce, based on the config.yaml

ChabbyTMD commented 2 years ago

This makes sense. I'm still coming to grips with snakemake best practices. Does this mean that the conditional logic for the workflow i.e. whether whole genome or amplicon analysis and Bgzip will be in here as well?

ChabbyTMD commented 1 year ago

Hi Sanjay,

Please look through my edits to common.smk. I have incorporated input functions for all inputs to rule all for your review. Also find below my edits to the main workflow file. Let me know if I can open a PR for the changes.

Common.smk https://github.com/ChabbyTMD/AmpSeeker/blob/366c73224608efc353a3a5ce33cf1af7ab93e455/workflow/rules/common.smk

Snakefile https://github.com/ChabbyTMD/AmpSeeker/blob/366c73224608efc353a3a5ce33cf1af7ab93e455/workflow/Snakefile

sanjaynagi commented 1 year ago

Hey Trevor. Thanks for this! I guess I should have been a bit clearer:

I was thinking something like the getDesiredOutputs function in the following -
https://github.com/sanjaynagi/rna-seq-pop/blob/master/workflow/Snakefile https://github.com/sanjaynagi/rna-seq-pop/blob/master/workflow/rules/common.smk

So we just have one input function to rule all. And ideally this will read relevant flags from the config.yaml. So for example, if someone wants to run coverage analyses, there is a setting in the config such as:

coverage:
    activate: True

and within the function getDesiredOutputs(), you would have something like:

def getDesiredOutputs(wildcards):

  wanted_input = []

  wanted_input.extend(expand("results/vcfs/{dataset}_LSTM_merged.vcf", dataset=config['dataset']))

  if config['coverage']['activate']:
         if sequence_data == "amplicon":
           wanted_input.extend(expand("results/coverage/{sample}.per-base.bed.gz", sample=samples))
      if sequence_data == "wholegenome":
          wanted_input.extend(expand("results/wholegenome/coverage/windowed/{sample}.regions.bed.gz", sample=samples))

    return (wanted_input)

The extend function just adds on the files to the wanted_input list. For the minute there isn't many options we can put in the config, because we haven't made scripts to do with analysing the data, and we will always want to map and call genomic variants.

In general, I like to follow the practices that Johannes Koester, the author of snakemake, recommends:

https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#distribution-and-reproducibility

and if i need an example, tend to look at some of his example workflows, for example: https://github.com/snakemake-workflows/dna-seq-gatk-variant-calling/tree/main/workflow

There are so many ways one could structure a snakemake workflow, I found it useful to follow these guidelines. Although its worth noting, that not many other people do! :)

ChabbyTMD commented 1 year ago

Hey Sanjay this is clearer to me now. I'll begin work on it and keep you posted.

Also thank you for the snakemake best practices resources, these will be very valuable to me down the line.

sanjaynagi commented 1 year ago

Awesome :)

One thing to remember (which I always forget), is to update the test/config/config.yamls when you change the normal config.

ChabbyTMD commented 1 year ago

Yes, I'll be sure to keep those two up to date with each other.

Please find my modifications to common.smk, Snakefile and config down below.

common.smk https://github.com/ChabbyTMD/AmpSeeker/blob/041a17aa756cd485ba74fc95cb9dd8ae41b4b97c/workflow/rules/common.smk

Snakefile https://github.com/ChabbyTMD/AmpSeeker/blob/041a17aa756cd485ba74fc95cb9dd8ae41b4b97c/workflow/Snakefile

config.yaml https://github.com/ChabbyTMD/AmpSeeker/blob/041a17aa756cd485ba74fc95cb9dd8ae41b4b97c/config/config.yaml

sanjaynagi commented 1 year ago

Awesome Trevor. Look great. happy if you would like to make a PR.

My only comments -

# this container defines the underlying OS for each job when using the workflow
# with --use-conda --use-singularity
singularity: "docker://continuumio/miniconda3"

Please could you delete the above in common.smk? we don't need it.

import pandas as pd

configfile: "config/config.yaml"
dataset=config['dataset']
metadata = pd.read_csv(config['metadata'], sep="\t")
samples = metadata['sampleID']
sequence_data = config['sequence_data']

And could you move the above into the Snakefile? probably right at the top. Thank you!

sanjaynagi commented 1 year ago

resolved with #11

sanjaynagi / AmpSeeker

move input file list to common.smk #3