humburg / pirates

Improving the quality of deep sequencing data
MIT License
0 stars 0 forks source link

Eliminating errors from high-throughput sequencing data

Buid Status Coverage Status Code Health

This project was part of HealthHack 2016. The original solution developed there is still available at https://github.com/HealthHackAu2016/pirates.

The problem

Sequencing of entire human genomes is now readily available and this technology is used extensively to study the contribution of genetic variation to a wide variety of diseases. The identification of inherited genetic variants carried by an individual has become a relatively routine task.

However, not all genetic variation is inherited. Each individual accumulates a large number of mutations throughout their life, with different cells carrying different sets of mutations. While most of these have no noticeable impact some cause disease. Since these mutations are only present in a small fraction of cells they are a lot harder to identify. Not only is it necessary to generate more data to be able to observe these, potentially very rare, mutations, the presence of errors in the data limits our ability to detect disease causing mutations.

The aim of this project is to post-process high-throughput sequencing data to identify and correct errors in the sequences. This will improve the signal-to-noise ratio and lower the threshold for the frequency required to reliably detect mutations. Ultimately this will not only lead to a better understanding of disease but also has the potential to enable better targeted treatments for patients.

Available data

A high-throughput, paired-end deep sequencing dataset from an experiment targeting 163 genes is available for testing in the form of a fastq file (3GB download). A small subset of these data consisting of the first 10,000 reads is also available (1.8MB download). Alternatively, both datasets are available as data volumes from DockerHub as humburg/jurkat-only-rna-assembled and humburg/test-rna-assembled respectively.

Overlapping reads from each pair were merged into a single sequence with PEAR (v0.9.10). Each end of the resulting sequence contains an eight base barcode. Together these 16 bases form a unique molecular identifier. The barcode on each end is followed by four constant bases (GACT/AGTC).

The solution

Our algorithm processes these reads (UIDs, genetic sequence, quality information) to remove errors generated by the sequencer. Errors can occur in any component of the sequence information. We start by matching the IDs of these sequences to form groups/clusters with the same sequence ID. If two reads have the same ID we form a consensus using the sequence itself. We generate the consensus by comparing our new sequence to our reference sequence and taking the higher quality character from either sequence. We are then left with consensus reads created from the summary of many other reads and singleton reads. We then compare the singletons to the consensus groups we have created using a similar methodology as above.