This code supports a series of scientific papers. It is released for transparency, as an aid in understanding the work. The code for specific papers has been or will be archived at Zenodo.org via separate GitHub repositories. If someone can reuse the code (after adjusting paths) that is great, but that is not the main goal here.
The Slurm job scripts
are numbered in the order
in which they are to be run,
i.e. submitted with sbatch
(010, 020, 030_, ...).
The workflow is similar
to the one described
on the [[http://www.htslib.org/workflow/wgs-call.html][Samtools workflow page]],
but combines steps with pipes
to reduce repeated disk read/write steps.
Several parts of each script
that need to be edited
are indicated
(UPDATE_HERE).
Other parameters
(such as 'partition=main')
might need to be adjusted
on other clusters
or to run jobs that are similar
but more resource-intensive.
Indexing of reference sequences
needs to be done just once (015).
The other steps run in parallel,
using a Slurm job array,
with one job per library.
We trim reads with Cutadapt (010),
align them with BWA MEM (020),
and do the processing needed
for calling variants with VarScan (030).
These steps are split into multiple job scripts
mainly because the BWA MEM step
benefits more from allocating multiple CPU cores
than the others do.
Variables are defined
for the location of the input FastQ files
(FASTQDIR)
and the trimmed read FastQ files
generated from them
(CUTADAPTDIR).
ls
is used (with sed
)
to get prefixes from the names of those input files
that are then then used for subsequent output files.
Output from the 020 and 030 scripts
is written to a third directory
(OUTDIR).
Job scripts were run on a cluster on which Lmod is used to manage software. If you are not using Lmod you'll need the relevant binaries (cutadapt, bwa, fastqc, ...) to be in your $PATH.
Some lines in each job script (pwd, date, module list) are included purely to provide context in the log files that Slurm writes to disk (STDERR and STDOUT, in one or two files).