Ultra-low frequency DNA mutations are confounded with technical artifacts. Unique molecular identifiers (UMIs) can be used to call these variants with confidence. However, errors before UMI tagging, such as DNA polymerase errors during end repair and the first PCR cycle cannot be corrected with single-strand UMIs as used in Error-corrected sequencing and are a fundamental limitation to this method.
An implementation of the bioinformatics recommendations detailed at https://www.jove.com/video/57509/rare-event-detection-using-error-corrected-dna-and-rna-sequencing to call variants from error-corrected DNA sequences.
Multiple samples are sequenced to high coverage using targeted capture, and paired-end fastq files are generated for each of them. The following analysis is done on each one of them to generate a BAM file for each of them. The final SNP calls are generated based on the error-profile that is based on all the alignment files.
For each sample:
Use binomial distribution to call single nucleotide variants (SNPs) in retained data from Step 2.5.7 with the following parameters. The binomial statistic will be based on a genomic position-specific error model. Each genomic position is modeled independently after summing out the error rates of all samples for that particular position. Following the example:
Probability of nucleotide profile at a given genomic position, p ∑ Variant RF2 ∑ Total RFs = 26/255505 = 0.000101759 Binomial probability of 24 variant RFs out of 35911 total RFs, P(X ≥ x) in sample K = 1 - binomial(24, 35911, 0.000101759) = 2.26485E-13
NOTE:
For each genomic position queried, there would be three possible mutational changes (i.e.,A>T, A>C, A>G), and each of which would be represented as background artifact. Somatic events that are significantly different from the background after Bonferroni correction are retained. In the example shown in Table 1, the number of tests performed was 11, hence a Bonferroni corrected p-value ≤0.00454545 (0.05/11) was required to call an event as statistically significant.
Somatic events are required to be present in both replicates from the same specimen; otherwise, regard them as false positives.
Create a test dataset. ${reference_fasta} refers to a fasta sequence of hg19.
cd tests
./simulate_fragments ${reference_fasta} > fragments.fa
./simulate_pe_reads
gzip read_1.fq
gzip read_2.fq
Now run the pipeline on the simulated dataset after setting the values in the input json file. The value corresponding to process_samples.inputs in the JSON fiel should be a file with the columns that refer to the sample name, library name, absolute path to the first read file, and absolute path to the second read file. The variable ${cromwell} should point to the JAR for cromwell (https://github.com/broadinstitute/cromwell). An example configuration file is included, but should be modified so that it point to the correct files and directories for the user.
java -Xmx4g -Dconfig.file=../src/local.conf -jar ${cromwell} run ../src/process_samples.wdl --inputs process_samples.json
The output file of the workflow is named 'variants.ann.txt' and can be found in the cromwell-executions folder. Lets make a soft link to it.
ln -s `find . -name "variants.ann.txt" ` .
Lets create a simple plot which show the sequencing depth and VAF of the mutations in this dataset, and whether we found them or not.
./plot_mutations.R
Take a look at the file assessment.pdf. Most of the variants we miss should be the ones with extremely low VAF.