The QBRC mutation calling pipeline is a flexible and comprehensive pipeline for mutation calling that has glued together a lot of commonly used software and data processing steps for mutation calling. The mutation calling software include: sambamba, speedseq, varscan, shimmer, strelka, manta, lofreq_tar. It identifies somatic and germline variants from whole exome sequencing (WXS), RNA sequencing and deep sequencing data. It can be used for human, PDX, and mouse data (fastq files or bam files as input). \ Please refer to the lab website of Dr. Tao Wang, https://qbrc.swmed.edu/labs/wanglab/index.php, for more information.
For a paird of 'fastq.gz'files of 200M, it takes around 2 hours to finish somatic mutation calling.
64 bit linux operating system
BWA (version >=0.7.15)
STAR (required if applied for RNA sequencing data)
sambamba
speedseq
varscan
samtools (version >=1.6)
shimmer
annovar (database downloaded in default folder: refGene,ljb26_all,cosmic70,esp6500siv2_all,exac03,1000g2015aug)
python2
strelka (version >=2.8.3, note: strelka is tuned to run exome sequencing or RNA sequencing)
manta (version >=1.4.0)
java (version 1.8)
perl (Parallel::ForkManager)
lofreq_star (version >=2.1.3, for tumor-only calling)
bowtie2 (version>= 2.3.4.3, for Patient Derived Xenograft models) \
picard.jar (please download the file https://drive.google.com/file/d/1lL_vUgrY6VAtjG87bf9PXgYubuYd-st2/view?usp=sharing and place it under the folder named "somatic_script" before running the pipeline)
Input can be fastq files or bam files or a mixture of fastq and bam files.
The code for somatic and germline mutation calling for a pair of normal and tumor sequencing files.
perl /Path/to/somatic.pl <normal_fastq1> <normal_fastq2/NA> <tumor_fastq1> <tumor_fastq2/NA> <thread> <build> <index> <java17> </Path/to/output> <pdx> <disambiguate_pipeline>
perl ~/somatic/somatic.pl ~/seq/1799-01N.R1.fastq.gz ~/seq/1799-01N.R2.fastq.gz ~/seq/1799-01T.R1.fastq.gz ~/seq/1799-01T.R2.fastq.gz 32 hg38 ~/ref/hg38/hs38d1.fa /cm/shared/apps/java/oracle/jdk1.7.0_51/bin/java ~/somatic_result/1799-01/ human 1 ~/disambiguate_pipeline
Input seuqencing files:
(1) If input are fastq files, they must be 'gz' files. 'sequencing_file_1', 'sequencing_file_2' are path to fastq1 and fastq2 of control sample; 'sequencing_file_3', 'sequencing_file_4' are path to fastq1 and fastq2 of sample of interest.
(2) If input are bam files, use "bam /path/to/bam/files.bam" in replace of the tow corresponding fastq input files.
(3) If input are RNA sequencing files, use "RNA:fastq1" or "RNA:bam" at the first or third slot.
(4) If input are deep exome sequencing data, use "Deep:fastq1" at the first or third slot.
(5) For tumor-only calling, put "NA NA" in the first two slots. Results will be written to germline output files.
(6) Optional: run somatic_script/SurecallTrummer.jar on the fastq files before runnign somatic.pl for deep seuquencing files.
(7) If only single end fastq data are available, put the fastq file(s) at the first and/or the third slots, then put NA in the second and/or fourth slot.
Slurm wrapper for somatic.pl for a batch of sampels and it is easy to change for other job scheduler system by revising this line of code: "system("sbatch ".$job)" and using proper demo job submission shell script.
perl /Directory/to/folder/of/code/job_somatic.pl design.txt example_file thread build index java17 disambiguate_pipeline
(5 columns; columns seperated by tab):
~/seq/1799-01N.R1.fastq.gz ~/seq/1799-01N.R2.fastq.gz ~/seq/1799-01T.R1.fastq.gz ~/seq/1799-01T.R2.fastq.gz ~/out/1799-01/ human
~/seq/1799-02N.R1.fastq.gz ~/seq/1799-02N.R2.fastq.gz ~/seq/1799-02T.R1.fastq.gz ~/seq/1799-02T.R2.fastq.gz ~/out/1799-02/ human
~/seq/1799-03N.R1.fastq.gz ~/seq/1799-03N.R2.fastq.gz ~/seq/1799-03T.R1.fastq.gz ~/seq/1799-03T.R2.fastq.gz ~/out/1799-03/ human
perl ~/somatic/job_somatic.pl somatic_design.txt ~/somatic/example/example.sh 32 hg38 ~/ref/hg38/hs38d1.fa /cm/shared/apps/java/oracle/jdk1.7.0_51/bin/java 0 2 ~/disambiguate_pipeline
Post-processing script for somatic mutations for a batch of sampels.
Rscript filter.R design.txt output build index VAF_cutoff filter
sample_id patient_id folder 1799-01 pat-01 ~/filter/1799-01/ 1799-02 pat-02 ~/filter/1799-02/ 1799-03 pat-03 ~/filter/1799-03/
Rscript ~/somatic/filter.R filter_design.txt ~/filter/ hg38 ~/ref/hg38/hs38d1.fa 0.01 FALSE
Pipeline for somatic copy number variation calling and quality check for each sample
perl cnv.pl
sequencing_file_1
sequencing_file_2
sequencing_file_3
sequencing_file_4
thread index somatic_mutation_result output
perl ~/somatic/cnv.pl ~/seq/1799-01N.R1.fastq.gz ~/seq/1799-01N.R2.fastq.gz ~/seq/1799-01T.R1.fastq.gz ~/seq/1799-01T.R2.fastq.gz 32 ~/ref/hg38/hs38d1.fa ~/somatic_result/1799-01/somatic_mutation_hg38.txt ~/cnv_result/1799-01
Slurm wrapper for cnv.pl for a batch of samples and it is easy to change for other job scheduler system by revising this line of code: "system("sbatch ".$job)" and using proper demo job submission shell script.
perl job_cnv.pl design.txt example.sh thread index
(6 columns; columns seperated by tab):
~/seq/1799-01N.R1.fastq.gz ~/seq/1799-01N.R2.fastq.gz ~/seq/1799-01T.R1.fastq.gz ~/seq/1799-01T.R2.fastq.gz ~/somatic_result/1799-01/somatic_mutations_hg38.txt ~/cnv_result/1799-01/
~/seq/1799-02N.R1.fastq.gz ~/seq/1799-02N.R2.fastq.gz ~/seq/1799-02T.R1.fastq.gz ~/seq/1799-02T.R2.fastq.gz ~/somatic_result/1799-02/somatic_mutations_hg38.txt ~/cnv_result/1799-02/
~/seq/1799-03N.R1.fastq.gz ~/seq/1799-03N.R2.fastq.gz ~/seq/1799-03T.R1.fastq.gz ~/seq/1799-03T.R2.fastq.gz ~/somatic_result/1799-03/somatic_mutations_hg38.txt ~/cnv_result/1799-03/
perl ~/somatic/job_cnv.pl cnv_design.txt ~/somatic/example/example.sh 32 ~/ref/hg38/hs38d1.fa 2
Summarizing script for CNV and quality check callings for a batch of samples.
Rscript summarize_cnv.R design.txt output index
(2 columns; columns seperated by tab; header):
sample_id folder
1799-01 ~/cnv_result/1799-01
1799-02 ~/cnv_result/1799-02
1799-03 ~/cnv_result/1799-03
Rscript ~/somatic/summarize_cnv.R cnv_sum_design.txt ~/cnv_sum/ ~/ref/hg38/