rouskinlab / seismic-rna

https://rouskinlab.github.io/seismic-rna/
GNU General Public License v3.0
6 stars 1 forks source link

Correct mutation rates for observer bias while relaxing assumption that reads span the full section #10

Closed matthewfallan closed 6 months ago

matthewfallan commented 7 months ago

Background

Reads with mutations separated by fewer than 3 non-mutated nucleotides are underrepresented. Presumably, the reverse transcriptase cannot read through two closely spaced chemical modifications, causing such reads to drop out, so that it is rare for an observed read to have two close mutations. Consequently, the chemical reactivities of the observed reads are inherently biased from the true, underlying chemical reactivities among all reads, even those with two or more close chemical modifications.

SEISMIC-RNA can already infer the true chemical reactivities from the observed reads. However, the correction assumes that all reads span the full section. While this assumption is generally valid for amplicon-based samples, applying it to randomly fragmented samples requires discarding any reads that do not fully overlap the section, which reduces the sequencing depth. This requirement makes it impossible to correct the mutation rates over sections that are longer than the read length, including sections produced by the join command.

Proposal

The algorithm that corrects mutation rates could be modified so that the results will still be valid even if reads do not span the full section. First, while the current algorithm calculates only the likelihood that a full-length read would no contain two mutations too close, it could be retooled to calculate this likelihood for reads with every possible pair of start and end coordinates, using a very similar dynamic programming method. Second, it could introduce a new parameter for the proportion of reads aligned to each pair of start and end coordinates, which would also be solved for along with the mutation rates, based on the observed distribution of start and end coordinates.

The main disadvantage of the updated algorithm is its increased computational cost. The current algorithm has time and space complexities of roughly O(N) with respect to the length of the sequence, while the proposed algorithm would likely scale as O(N^3) and also have a somewhat larger fixed cost. However, the time cost may not be prohibitive in practice: for short sections (< 300 nt), an O(N^3) cost would not be as noticeable; while for sequences longer than the read lengths, the algorithm would change an impossible problem into a slowly solved problem. Further work could also enhance the performance, such as careful optimization of the code, sparse matrices for the start/end coordinates, and/or integration with just-in-time Python compilers (e.g. Numba).