gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
365 stars 76 forks source link

StringTie dealing with the read alignments which do not have the XS tag #205

Open lauraht opened 5 years ago

lauraht commented 5 years ago

Hi Geo,

In my BAM file, there is a certain number of spliced read alignments which do not have the XS tag. My understanding is StringTie will ignore those spliced read alignments which do not have the XS tag. I noticed that StringTie produced many single-exon transcripts in the output GTF file. So I was wondering if those extra single-exon transcripts outputted by StringTie were because of those read alignments which do not have the XS tag? I wonder how StringTie deals with the read alignments which do not have the XS tag--- does StringTie simply not take those read alignments as input, or turn them into single-exon transcripts in some way? I used StringTie v1.3.2d.

I’d appreciate your advice.

Thank you very much!

gpertea commented 5 years ago

Unfortunately those spliced alignments which lack the XS tag are currently just discarded (skipped over) by StringTie, so they will not contribute to the assembly and abundance estimation at all.. Indeed, seeing a bunch of single-exon transcripts in those locations, despite having spliced alignments there, may be an indication that only single-exon alignments were taken into account in that region (surely you have those too, right?) Also, not sure what annotation you were using there (was it GENCODE?), but v1.3.2 might have still been affected by the issue of converting some gene or transcript annotations into large single-exon transcripts spanning the whole gene/transcript. (I know that this problem was affecting some older StringTie versions but I seem to recall that I kept having trouble with it even up to v1.3.4). Just to be on the safe side there, could you run 1.3.4d or later?

lauraht commented 5 years ago

Hi Geo,

Thank you so much for your advice!

So you mean seeing so many single-exon transcripts is not because of those read alignments which do not have the XS tag, right? In other words, if those alignments had the XS tag, the single-exon alignments in those regions would still be reported as single-exon transcripts, is that right?

I am using Ensembl. I was just wondering what might have caused some genes/transcripts to turn into large single-exon transcripts spanning the whole gene/transcript. Are those large single-exon transcripts usually unstranded? I found that many single-exon transcripts in the GTF output seem to have no strand info.

Thanks a lot!

gpertea commented 5 years ago

I didn't quite mean that, the lack of the XS tag could still be a problem - as I mentioned, spliced alignments are ignored if they lack the XS tag, so you're left with single-exon alignments -- and that's why you might only see single-exon alignments with undefined strand (which stringtie does not ignore).

In other words, if those alignments had the XS tag, the single-exon alignments in those regions would still be reported as single-exon transcripts, is that right?

Only if they do not overlap the spliced alignments, yes -- in that case they'll end up assembled as single exon transcripts (with undetermined strand), as they are now.

I was just wondering what might have caused some genes/transcripts to turn into large single-exon transcripts spanning the whole gene/transcript. Are those large single-exon transcripts usually unstranded?

Reference annotation usually has a strand assigned even to single exon transcripts (transcription strand should be known for a proper annotation). The single exon transfrags generated by stringtie have no strand info unless they overlap single-exon reference transcripts (which have a strand assignment), in which case they'll get the same strand assignment.

Just to eliminate the possibility of that GTF parsing bug in StringTie, please make sure you run v1.3.4 or newer.

lauraht commented 5 years ago

I see. So if those spliced alignments had the XS tag, the single-exon alignments that overlap those spliced alignments would have been represented by the longer spliced transcripts during the assembly process and thus would not have appeared as single-exon transcripts by themselves in the GTF output. In other words, since those spliced alignments which do not have the XS tag are ignored, the single-exon alignments that overlap those spliced alignments just become single-exon transcripts by themselves and therefore make the GTF output to have so many single-exon transcripts. This makes sense.

Also, I just realized that when you talked about the GTF parsing bug or the issue of converting some gene or transcript annotations into large single-exon transcripts spanning the whole gene/transcript, you actually referred to the case of using the reference annotation for guiding the assembly process (the -G option), is that right? Actually I am not using the -G option, so I guess we could eliminate the possibility of that GTF parsing bug (if I understand this correctly).

Thank you so much for the help!

gpertea commented 5 years ago

Yes, that understanding of the effect of the missing XS tag is correct -- only single-exon alignments are then seen by StringTie, so the output will consist only of single-exon assembled transfrags (assuming -G was not used). And yes, I thought you were using the -G option (that's why I mentioned that old bug in the reference annotation parsing, which could have been another (very different) possible cause for getting (large) single-exon transcripts in the output).

QuanLG commented 4 years ago

HI,gpertea. Histat2 can add XS tag for all read alignments,I have a stranded library (fr-firststrand) data, I use STAR to mapping genome and set options --outSAMstrandField intronMotif which can add XS flag to spliced alignments only, other read alignments do not have the XS tag. Then I use stringtie to Transcript assembly and quantification,so I want to know is there a problem with that? And XS tag plays a decisive role in Transcript assembly and quantification?

ZHIDIHUAYUAN commented 3 years ago

Hi gpertea, I used the stringtie(v2.1.5) to assembly the transcripts and used the Refseq GFF file as the annotiation. I got many single-exon transcripts and they were unstranded but there are some single-exon transcripts that have the strand information. How can I deal with them?

Thank you.