ratan-lab / mecs

Implementation of method used in Wong et al. for Error corrected DNA sequencing
0 stars 0 forks source link

Ultra-low frequency DNA mutations are confounded with technical artifacts. Unique molecular identifiers (UMIs) can be used to call these variants with confidence. However, errors before UMI tagging, such as DNA polymerase errors during end repair and the first PCR cycle cannot be corrected with single-strand UMIs as used in Error-corrected sequencing and are a fundamental limitation to this method.

Mutation calling for Error-Corrected Sequencing

An implementation of the bioinformatics recommendations detailed at https://www.jove.com/video/57509/rare-event-detection-using-error-corrected-dna-and-rna-sequencing to call variants from error-corrected DNA sequences.

Multiple samples are sequenced to high coverage using targeted capture, and paired-end fastq files are generated for each of them. The following analysis is done on each one of them to generate a BAM file for each of them. The final SNP calls are generated based on the error-profile that is based on all the alignment files.

For each sample:

NOTE:

Requirements

Tools & Frameworks

  1. BWA
  2. Sambamba
  3. SAMtools
  4. PEAR (Paired-End reAd mergeR)
  5. snpEff
  6. Cromwell (https://github.com/broadinstitute/cromwell)
  7. Java
  8. Python

Additional python libraries

  1. scipy
  2. pysam
  3. numpy

R libraries

  1. tidyverse

Running a simulation test

Create a test dataset. ${reference_fasta} refers to a fasta sequence of hg19.

cd tests
./simulate_fragments ${reference_fasta} > fragments.fa
./simulate_pe_reads
gzip read_1.fq
gzip read_2.fq

Now run the pipeline on the simulated dataset after setting the values in the input json file. The value corresponding to process_samples.inputs in the JSON fiel should be a file with the columns that refer to the sample name, library name, absolute path to the first read file, and absolute path to the second read file. The variable ${cromwell} should point to the JAR for cromwell (https://github.com/broadinstitute/cromwell). An example configuration file is included, but should be modified so that it point to the correct files and directories for the user.

java -Xmx4g -Dconfig.file=../src/local.conf -jar ${cromwell} run ../src/process_samples.wdl --inputs process_samples.json

The output file of the workflow is named 'variants.ann.txt' and can be found in the cromwell-executions folder. Lets make a soft link to it.

ln -s `find . -name "variants.ann.txt" ` .

Lets create a simple plot which show the sequencing depth and VAF of the mutations in this dataset, and whether we found them or not.

./plot_mutations.R

Take a look at the file assessment.pdf. Most of the variants we miss should be the ones with extremely low VAF.