jshoyer / circular-ssDNA-virus-subconsensus-variant-analysis

Code supporting papers on deep sequencing of begomovirus populations
MIT License
0 stars 0 forks source link

This code supports a series of scientific papers. It is released for transparency, as an aid in understanding the work. The code for specific papers has been or will be archived at Zenodo.org via separate GitHub repositories. If someone can reuse the code (after adjusting paths) that is great, but that is not the main goal here.

The Slurm job scripts are numbered in the order in which they are to be run, i.e. submitted with sbatch (010, 020, 030_, ...). The workflow is similar to the one described on the [[http://www.htslib.org/workflow/wgs-call.html][Samtools workflow page]], but combines steps with pipes to reduce repeated disk read/write steps. Several parts of each script that need to be edited are indicated (UPDATE_HERE). Other parameters (such as 'partition=main') might need to be adjusted on other clusters or to run jobs that are similar but more resource-intensive.

Indexing of reference sequences needs to be done just once (015). The other steps run in parallel, using a Slurm job array, with one job per library. We trim reads with Cutadapt (010), align them with BWA MEM (020), and do the processing needed for calling variants with VarScan (030). These steps are split into multiple job scripts mainly because the BWA MEM step benefits more from allocating multiple CPU cores than the others do. Variables are defined for the location of the input FastQ files (FASTQDIR) and the trimmed read FastQ files generated from them (CUTADAPTDIR). ls is used (with sed) to get prefixes from the names of those input files that are then then used for subsequent output files. Output from the 020 and 030 scripts is written to a third directory (OUTDIR).

Job scripts were run on a cluster on which Lmod is used to manage software. If you are not using Lmod you'll need the relevant binaries (cutadapt, bwa, fastqc, ...) to be in your $PATH.

Some lines in each job script (pwd, date, module list) are included purely to provide context in the log files that Slurm writes to disk (STDERR and STDOUT, in one or two files).