Open alanlorenzetti opened 6 years ago
Dear Alan,
Apologies for the late reply. You hit us during vacation season. Also thank you for the detailed report and the example case. We will have a look at the matter and evaluate how easy a fix would be.
Thanks and Cheers, Andre
Intro
Dear Rätsch Lab team,
The MMR software is very useful to our research group, and it is also an important part of our pipeline of RNA-Seq analysis. Although it performs with no issues for single-end reads, I needed to develop a workaround for a very particular case that appears in every paired-end library I have analyzed. I will try to explain it here with a minimal reproducible example which is available on the second part of this report.
First of all, let's schematize a particular case where a small RNA fragment of 21 nt (Insert1) aligns on direct repeats (purple). There are three possible alignments which should be reported by the aligner if we request all possible alignments satisfying the pair constraint: Alignment1 (red), Alignment2 (green) and Alignment3 (blue). Alignments 1 and 2 could represent the real RNA fragment, and Alignment3, although not real in this case, is a valid possibility given the aligner limitation.
After running MMR enabling the "best only" option (-b), we should have only Case I or Case II (see scheme below), and in fact these are commonly reported. However, sometimes Case III is reported, i.e., MMR selects the R1 from Alignment1 and R2 from Alignment3 and report them with updated flags, but other fields as they were originally in the input SAM/BAM file:
Note that a valid pair in SAM/BAM file must consist in R1 having the field#4 (pos) matching the R2 field#8 (mate pos) and vice-versa (see first two entries above). The last two entries (INSERT4) output by MMR don't satisfy these criteria, and also have inconsistent fragment size on field#9.
This leads to strange behavior while running visualization tools like IGV, for example:
I don't think this issue can cause any damage, since the number of inconsistent pairs in my libraries are not significant at all (~ 100/1,500,000). However, I can't predict the behavior of downstream analysis tools and therefore would rather removing these inconsistent pairs by running the workaround script cited before. This script is serial, and running it for big sets of data can last a long time, but I still didn't invest any time to make it better.
I hope this report can help you understand the experienced issue, and I would appreciate if you can fix it, since it may not be a difficult task to adjust these field values before outputting the final BAM file. Let me know if you need anymore conceptual help or descriptions.
kind regards, Alan.
Reproducing this issue
Every file presented here are available inside this tarball.
For this minimal reproducible example we will need a hypothetical reference genome, hypothetical small reads aligning to direct repeats and the following programs:
Reference genome is a random sequence of 500 bp (60% GC) created with http://www.faculty.ucr.edu/~mmaduro/random.htm . Direct repeats were created manually: Repeat1 from 10 to 30; seq: GATCCGGAGGGACGGGCCTCA Repeat2 from 111 to 131; seq: GATCCGGAGGGACGGGCCTCA
ref_genome.fasta:
Paired-end reads that align to these regions were created. Every base has a Phred of 38 (Phred+64 encoding). Five inserts are enough to reproduce the behavior. R2 is the reverse complement of R1 in this case.
R1.fastq
R2.fastq
With files on the working directory, run the script called analysis.sh:
One may note the pair inconsistencies inside aln-mmr.bam file or visualize the sorted version (aln-mmr-sorted.bam) on IGV.