lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.81k stars 412 forks source link

strand-specific mapping is not filtering correctly #95

Closed DepledgeLab closed 6 years ago

DepledgeLab commented 6 years ago

I'm trying to map some PacBio IsoSeq data (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE97785) to compare versus my own data generated by Illumina and other platforms.

minimap2-2.7_x64-linux/minimap2 -ax splice -uf --secondary=no ../../../reference_sequences/HSV1-st17.fasta pacbio.fastq > hsv1.pacbio.minimap2.sam

The issue i'm having is that the resulting mapping are not reflective of read data after I sort for 0 and 16 flags which otherwise works on Illumina and other data types.

plot

The attached plot shows nanopore, pacbio and illumina data and shows the issue clearly.

I'd appreciate any thoughts you have...

DepledgeLab commented 6 years ago

As an update, I tried mapping the reads with GMAP and filtering on their AS:A:+ / AS:A:- flags and ended up with the same plot. I think there is something amiss with the actual PacBio data and I will contact the authors to find out.

lh3 commented 6 years ago

There are three types of strands. I am not sure what you are looking for.

  1. The alignment strand relative to the reference genome (SAM flag field)
  2. The transcript strand relative to the reference (XS tag in gmap output)
  3. The alignment strand relative to the transcript (ts tag in minimap2 output)

When you use gmap and force the alignment to the sense strand, 1 and 2 are always identical. When you apply -uf to minimap2, ts is always +.

lh3 commented 6 years ago

BTW, see also #88

DepledgeLab commented 6 years ago

What I am trying to do is determine from which strand the original RNA originates (i.e. stranded RNA-Seq). This is simple with Illumina and nanopore as I just filter on the SAM flag field (shown in the upper and lower panels of the figure above). However, filtering on SAM flag for PacBio IsoSeq reads (aligned with minimap2) does not result in the same figure (middle panel), nor does filtering for XS tag when using gmap.

lh3 commented 6 years ago

Not all iso-seq reads come from the transcript strand. For non-full-length reads, you should not use -uf. After alignment, you need to jointly look at the ts tag and the SAM flag field: if you see ts:A:+, the transcript strand is the alignment strand from the SAM flag field; if ts:A:-, the transcript strand is the opposite of the alignment strand. Similarly with gmap, you should not apply option -z.

DepledgeLab commented 6 years ago

I see what you mean but the issue i'm having is that the ts tag does not appear in the SAM output if I remove the -uf option

e.g. if i run the command as

minimap2-2.7_x64-linux/minimap2 -ax splice --secondary=no reference_sequence.fasta pacbio.fastq > output.sam

the only tags in the output file are NM:i, ms:i, AS:i, nn:i, tp:A:P, cm:i, s1:i, s2:i:0, dv:f

On Thu, Jan 18, 2018 at 1:49 PM, Heng Li notifications@github.com wrote:

Not all iso-seq reads come from the transcript strand. For non-full-length reads, you should not use -uf. After alignment, you need to jointly look at the ts tag and the SAM flag field: if you see ts:A:+, the transcript strand is the alignment strand from the SAM flag field; if ts:A:-, the transcript strand is the opposite of the alignment strand. Similarly with gmap, you should not apply option -z.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/lh3/minimap2/issues/95#issuecomment-358744045, or mute the thread https://github.com/notifications/unsubscribe-auth/Ae_cisbGtdmnhwmsww4pSKcgHrAcbp1Uks5tL5IpgaJpZM4RiHFz .

lh3 commented 6 years ago

You only see ts when the read is spliced and there is at least one canonical junction (i.e. GT-AG); otherwise it is not possible to infer the strand. Gmap and other RNA-seq mappers work the same way.