Intro

Dear Rätsch Lab team,

The MMR software is very useful to our research group, and it is also an important part of our pipeline of RNA-Seq analysis. Although it performs with no issues for single-end reads, I needed to develop a workaround for a very particular case that appears in every paired-end library I have analyzed. I will try to explain it here with a minimal reproducible example which is available on the second part of this report.

First of all, let's schematize a particular case where a small RNA fragment of 21 nt (Insert1) aligns on direct repeats (purple). There are three possible alignments which should be reported by the aligner if we request all possible alignments satisfying the pair constraint: Alignment1 (red), Alignment2 (green) and Alignment3 (blue). Alignments 1 and 2 could represent the real RNA fragment, and Alignment3, although not real in this case, is a valid possibility given the aligner limitation.

part1

After running MMR enabling the "best only" option (-b), we should have only Case I or Case II (see scheme below), and in fact these are commonly reported. However, sometimes Case III is reported, i.e., MMR selects the R1 from Alignment1 and R2 from Alignment3 and report them with updated flags, but other fields as they were originally in the input SAM/BAM file:

part2

################ CORRECTLY RESOLVED ALIGNMENT ################
INSERT1 99  ref_genome  10  1   21M =   111 122 GATCCGGAGGGACGGGCCTCA   fffffffffffffffffffff   AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YS:i:0  YT:Z:CP XS:A:+  NH:i:3
INSERT1 147 ref_genome  111 1   21M =   10  -122    GATCCGGAGGGACGGGCCTCA   fffffffffffffffffffff   AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YS:i:0  YT:Z:CP XS:A:+  NH:i:3

################ INCORRECTLY RESOLVED ALIGNMENT ################
INSERT4 99  ref_genome  10  1   21M =   111 122 GATCCGGAGGGACGGGCCTCA   fffffffffffffffffffff   AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YS:i:0  YT:Z:CP XS:A:+  NH:i:3
INSERT4 147 ref_genome  10  1   21M =   10  -21 GATCCGGAGGGACGGGCCTCA   fffffffffffffffffffff   AS:i:0  ZS:i:0  XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:21 YS:i:0  YT:Z:CP XS:A:+  NH:i:3

Note that a valid pair in SAM/BAM file must consist in R1 having the field#4 (pos) matching the R2 field#8 (mate pos) and vice-versa (see first two entries above). The last two entries (INSERT4) output by MMR don't satisfy these criteria, and also have inconsistent fragment size on field#9.

This leads to strange behavior while running visualization tools like IGV, for example:

mre_igv

I don't think this issue can cause any damage, since the number of inconsistent pairs in my libraries are not significant at all (~ 100/1,500,000). However, I can't predict the behavior of downstream analysis tools and therefore would rather removing these inconsistent pairs by running the workaround script cited before. This script is serial, and running it for big sets of data can last a long time, but I still didn't invest any time to make it better.

I hope this report can help you understand the experienced issue, and I would appreciate if you can fix it, since it may not be a difficult task to adjust these field values before outputting the final BAM file. Let me know if you need anymore conceptual help or descriptions.

kind regards, Alan.

Reproducing this issue

Every file presented here are available inside this tarball.

For this minimal reproducible example we will need a hypothetical reference genome, hypothetical small reads aligning to direct repeats and the following programs:

hisat2
samtools
MMR

Reference genome is a random sequence of 500 bp (60% GC) created with http://www.faculty.ucr.edu/~mmaduro/random.htm . Direct repeats were created manually: Repeat1 from 10 to 30; seq: GATCCGGAGGGACGGGCCTCA Repeat2 from 111 to 131; seq: GATCCGGAGGGACGGGCCTCA

ref_genome.fasta:

>ref_genome
GACCCGGCCGATCCGGAGGGACGGGCCTCAAAGCCGCCTGACGACGGCTGCGGGCCCGTATCAGAATCCCCCCAGTGAGCCCCCGTGGCCGTCGGTTGAACAGCCCAGGAGATCCGGAGGGACGGGCCTCAGATTACGTCGCTTAACGGCTCCCGGGCCGGGGCGCGTTGCCTTGCAGGAATCGAGGCCGTCCGTTAATTCCTCTTGCGTTCGTACCGCGTATTTTTGTCTCTTTGCCCGCTTACCTGGACAAGGACGGCATAGCCTCTTACCGGAGCGCCTCCGTACACGGTACGACCGCGCGCCCCGTGAGACCAATGCGTATACCAGGTGTCCTGTGAGCAGCGAAGGCCCGAACGGGAGATACGCCGCCAGGAGTCGGCGTGAGTACGAGCCGTGGCGGATTTGGTCTGGCTGTGGTCTAGACATTCCAGGCGGTGCGTCTGCTCCGGCCTGCCTCTGGTGGCTCGCTAGATGGTCTAGCCGCTGGTAGACACTCC

Paired-end reads that align to these regions were created. Every base has a Phred of 38 (Phred+64 encoding). Five inserts are enough to reproduce the behavior. R2 is the reverse complement of R1 in this case.

R1.fastq

@INSERT1
GATCCGGAGGGACGGGCCTCA
+INSERT1
fffffffffffffffffffff
@INSERT2
GATCCGGAGGGACGGGCCTCA
+INSERT2
fffffffffffffffffffff
@INSERT3
GATCCGGAGGGACGGGCCTCA
+INSERT3
fffffffffffffffffffff
@INSERT4
GATCCGGAGGGACGGGCCTCA
+INSERT4
fffffffffffffffffffff
@INSERT5
GATCCGGAGGGACGGGCCTCA
+INSERT5
fffffffffffffffffffff

R2.fastq

@INSERT1
TGAGGCCCGTCCCTCCGGATC
+INSERT1
fffffffffffffffffffff
@INSERT2
TGAGGCCCGTCCCTCCGGATC
+INSERT2
fffffffffffffffffffff
@INSERT3
TGAGGCCCGTCCCTCCGGATC
+INSERT3
fffffffffffffffffffff
@INSERT4
TGAGGCCCGTCCCTCCGGATC
+INSERT4
fffffffffffffffffffff
@INSERT5
TGAGGCCCGTCCCTCCGGATC
+INSERT5
fffffffffffffffffffff

With files on the working directory, run the script called analysis.sh:

#!/bin/bash

# alorenzetti 201808

# required software

# hisat2
# samtools
# mmr

# starting analysis

# HISAT2 index build
hisat2-build ref_genome.fasta ref_genome

# HISAT2 aligment execution followed by name sorting
hisat2 -x ref_genome \
       -1 R1.fastq -2 R2.fastq \
       -k 1000 --rna-strandness FR \
       --no-mixed --no-discordant \
       --seed 0909 2> /dev/null |\
samtools sort -n > aln.bam

# mmr execution, coordinate sorting and indexing for visualization
mmr -S -b -p -o aln-mmr.bam -R 21 aln.bam
samtools sort aln-mmr.bam > aln-mmr-sorted.bam
samtools index aln-mmr-sorted.bam

One may note the pair inconsistencies inside aln-mmr.bam file or visualize the sorted version (aln-mmr-sorted.bam) on IGV.

ratschlab / mmr

Paired-end small reads aligning to direct repeats sometimes are not output properly #5

Intro

Reproducing this issue