gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
369 stars 77 forks source link

Abundance estimation after Minimap2 alignment #207

Open foolsgold opened 5 years ago

foolsgold commented 5 years ago

Hi Geo,

I have nanopore cDNA data aligned with minimap2 (with --cs --MD). I tried abundance estimation with StringTie v1.3.5 that should recognize the "CS" tag as an alternative to "XS". The spliced reads are identified but I find that abundance estimates are not calculated correctly. I have attached an example region to show my issue. There is a screenshot of Minimap2 (ONT) and Hisat2 (Illumina) reads alignment at the HAMP locus, and stringtie quantification of the Minimap2 alignment bams. The read alignments show highest abundance isoform as the transcript with 3 exons (ENST00000222304). This is also identified as the highest abundance transcript with Hisat2 alignment (stringtie counts not attached). However in the Minimap2 quantification with stringtie (t_data.ctab snippet in screenshot), the isoform with 4 exons (ENST00000598398) appears to have maximum abundance while the other isoforms have extremely low abundance. Do we need to change some parameters for abundance estimation (as well as assembly) with long spliced reads aligned with minimap2? I'd appreciate your advice. Thanks!

screen shot 2018-11-20 at 14 14 50

screen shot 2018-11-20 at 14 11 36

gpertea commented 5 years ago

At this stage we don't have much experience with using solely minimap2 alignments for abundance estimation with the current version of StringTie. We added that "cs" tag support as a simple technical convenience, only testing it with minimap2 alignments in very simple cases. Hence I cannot provide any advice here as I am as puzzled as you are at your observations here in this example, so I'd like to take a closer look if possible. I am trying to make sense of the figure you attached and admittedly I cannot even see the 4th exon on the left there (which is the 1st exon for that 4-exon isoform of course) being represented in the ONT alignments (or any alignments in that picture, actually). Also, probably unimportant but I'm not sure why there are both blue and red junction arcs there for ONT alignments, while for the Illumina ones there is only the red arc junction coloring - what does the blue coloring mean again, are those junctions found on the opposite strand? (sorry I haven't used IGV with junction display enabled). I am wondering if you can share some of the minimap2 alignments for this example region so I can take a closer look at how the 4-exon isoform ends up with such a high coverage..

foolsgold commented 5 years ago

Thanks a lot for the prompt reply!

I am trying to make sense of the figure you attached and admittedly I cannot even see the 4th exon on the left there (which is the 1st exon for that 4-exon isoform of course) being represented in the ONT alignments (or any alignments in that picture, actually).

Yes, there are a few reads that align to the 1st exon but most reads align from the 2nd exon only.

Also, probably unimportant but I'm not sure why there are both blue and red junction arcs there for ONT alignments, while for the Illumina ones there is only the red arc junction coloring - what does the blue coloring mean again, are those junctions found on the opposite strand? (sorry I haven't used IGV with junction display enabled).

The blue colouring is supposed to represent junctions on the (-) strand. However, here is just reflects the unstranded data. As the default nanopore protocol is unstranded, my single read fastq file does not contain strand information and I have reads aligning to both strands. The orientation of the read is inferred and written to the file at ts: SAM tag.

I am wondering if you can share some of the minimap2 alignments for this example region so I can take a closer look at how the 4-exon isoform ends up with such a high coverage..

I have uploaded the sam file from the region here Let me know if you are able to access it and if it is useful. Thanks a lot for your help!

gpertea commented 5 years ago

Thank you for the file, I was able to download it and I'm going to look into it. Also, thanks for making me realize I've been wrongly calling the newly supported minimap2 tag "cs" when in fact it is "ts" (and that's the tag I do use in the code anyway, only on the web page and in this topic here I have been calling it "cs" instead).

foolsgold commented 5 years ago

Thank you! Sorry if I added to the confusion regarding the tags. There is also an optional "cs" SAM/PAF tag in minimap2 output - this encodes the alignment, mismatches, INDELs, splice signal - similar to the MD tag.

gpertea commented 5 years ago

Just a note after the first test here - if I just directly assemble the bundle you gave me with stringtie, only one transcript is being reported in the output - the one with the 3 exons. So I suppose the abnormal isoform switching you observed only happened when you used the -e option with -G option pointing to some reference GFF or GTF which included both (or more) isoforms for this region. To save time, could you please share the reference isoforms you have in this region, taken from the GFF/GTF file you used for this issue? (or the whole file, compressed, or perhaps a link to it for me to download?)

foolsgold commented 5 years ago

Yes, I found this in the abundance estimation with known isoforms using -G with -e. I did not try the assembly. I have uploaded the GTF file containing just these reference isoforms here .

foolsgold commented 5 years ago

I have repeated the abundance estimation without restricting to known isoforms and find the transcript with three exons (ENST00000222304) as the most abundant transcript in all samples, both nanopore and Illumina. Moreover, the expression of other isoforms is extremely low, as observed in the actual alignment as well.

ONTONT

IlluminaIllumina

foolsgold commented 5 years ago

Hi Geo, Do you have any additional thoughts on why I see better abundance estimate without the "-e " option. From the manual I actually assumed the contrary, i.e. abundance estimation with "-e" should be more accurate when assembly is not required. Thanks!

gpertea commented 5 years ago

It's not about the abundance estimation being "better" with -e, it might be the other way around actually, due to the fact that read alignments are actually "forcefully" assigned to the given transcripts when -e option is used, instead of having those alignments flow "naturally" into StringTie's assembly engine which would generate the transcripts according to the observed evidence (alignments). In this particular it just so happens that the minimap2 alignments match almost perfectly the start of exon 2 and the end of exon 4 in the case of that longer isoform, while the boundaries of those alignments do not match as neatly in the case of the shorter, 3-exon isoform. We still think that should not cause StringTie to favor the longer isoform, considering there is zero support for the additional intron/exon there. I'll keep this issue open until we find the time to revisit that code and provide a resolution.

Meanwhile, the estimation (and structural) accuracy seems to be correct if you do not use -e, while still using -G option to "guide" the assembly according to the annotation.. Not sure if that works for you, since there will likely be "novel" transcripts that will lack the Ensembl "reference_id" tag in the StringTie output..