snakemake
: A tool for automating and streamlining your analyses πCreated by Stacy Li for the Center for Computational Biology at UC Berkeley.
This is a repository containing a sample configuration + workflow for learning snakemake
. I've tried to make this a minimally fussy example of how snakemake
works, and all the good it does for me in my work π
These materials were originally prepared for a live one-hour workshop. You can run the tutorial on your own by watching the recording, or following the slides and commands provided here. | Resource | Description |
---|---|---|
Recording | Available here. | |
Slides | Included with the repo, viewable by web here. Note that these have been edited to address errors caught during the live workshop. | |
Commands | See the commands section below. |
If you have any issues with the materials, please submit an issue via GitHub. I plan on actively maintaining this resource until I depart Berkeley βΊοΈ
You need conda
(Anaconda distribution, conda
, and miniconda
) or a conda
-like (mamba
, micromamba
, etc) package manager installed to run this tutorial. That's it! All other packages will be installed into their own isolated environments as you go along.
Don't have this yet? Click here and select the appropriate installer, or use one of the DataHub setup options below.
This section refers to the setup commands used during the live (1 hour) workshop. If you are working through this on your own, please go to (#independent-setup).
The setup below pre-populates the conda
environments you'll need for the workshop: this will save us precious time while we go over the lecture.
srun
)Use git clone
or download a copy of this repository to a suitable location. This will be your working directory.
conda env create --file envs/snakemake_fa23.yml
conda activate snakemake_fa23
snakemake --cores all --use-conda --conda-create-envs-only output/visuals/vcf_heatmap.pdf
This is a UC Berkeley only option β you need a Calnet ID to proceed.
First, click this link and proceed through the various pages until you launch your JupyterHub instance. If everything went well, you should have a cloned snakemake_fa23
folder.
Open up a terminal instance, then run the commands below:
cd snakemake_fa23
conda env create --file envs/snakemake_fa23.yml
bash
conda activate snakemake_fa23
snakemake --cores all --use-conda --conda-create-envs-only output/visuals/vcf_heatmap.pdf
This section refers to setup commands for independently working through the materials. conda
environments will be generated on-the-fly as you run the workflow.
srun
)Use git clone
or download a copy of this repository to a suitable location. This will be your working directory.
conda env create --file envs/snakemake_fa23.yml
conda activate snakemake_fa23
This is a UC Berkeley only option β you need a Calnet ID to proceed.
First, click this link and proceed through the various pages until you launch your JupyterHub instance. If everything went well, you should have a cloned snakemake_fa23
folder.
Open up a terminal instance, then run the commands below:
cd snakemake_fa23
conda env create --file envs/snakemake_fa23.yml
bash
conda activate snakemake_fa23
This is a list of commands from the live workshop. All of the rules are documented with comments as well. All examples below will run the workflow to generate the output/visuals/vcf_heatmap.pdf
target file.
To do a dry run of the workflow, where n
= max number of jobs to run in parallel:
snakemake -pj{n} --use-conda output/visuals/vcf_heatmap.pdf -np
For example, to run with 10 maximum jobs:
snakemake -pj10 --use-conda output/visuals/vcf_heatmap.pdf -np
To perform a real run of workflow:
snakemake -pj{n} --use-conda output/visuals/vcf_heatmap.pdf
To create a visualization of the rule graph:
snakemake --rulegraph --use-conda output/visuals/vcf_heatmap.pdf | dot -Tpng > output/visuals/workflow_rulegraph.png
To create a visualization of the full DAG (directed acyclic graph of every sample processed):
snakemake --dag --use-conda output/visuals/vcf_heatmap.pdf | dot -Tpng > output/visuals/workflow_dag.png
Once you feel comfortable with how the above commands work, I recommend trying out a few more things:
config/subset.tsv
to config/all_samples.tsv
. How can you evaluate what jobs might need to be run to update the vcf heatmap, without running the actual jobs?rule all
in the Snakefile
to ensure that both the coverage plot of your choice (.html
or .pdf
) and the vcf heatmap are always up to date.The short-read Fructilactobacillus sanfranciscensis data used in this workshop is from Rogalski et al 2020. The vcfR
heatmap plotting script is a modified version of a script from Olawoye et al 2020.
Thank you to my wonderful research group, The Sudmant Lab. As always, I am especially for my advisor Dr. Peter Sudmant and mentor Dr. Juan Manuel Vazquez. Everything I do is only possible because I stand upon the shoulders of giants. π
If you'd like to stay in touch, please feel free to connect with me using any of the platforms here. If you're local, come and say hi at the next CCB event π