gpertea / stringtie

Transcript assembly and quantification for RNA-Seq
MIT License
377 stars 78 forks source link

stringtie does not assign the strand for sam files with setted XS tag #262

Open camillaugolini-iit opened 4 years ago

camillaugolini-iit commented 4 years ago

Hi, I have some samfiles which are output of minimap2 used for long nanopore reads (RNA seq). Because minimap does not have the tag XS, I added it manually to each read. However stringtie continue to give some reads with strand ' . ' even if the flag in the sam file is present both for "+" reads and "-" ones. What parameter does stringtie use to set the strand?

gpertea commented 4 years ago

StringTie should have recognized the tags produced by minimap2 for spliced alignments -- there should not be a need to "manually" add the XS tags (there was a release note for a previous version that we added that feature). Anyway, about the transcript assemblies that you see in the output having the undetermined strand '.' -- are they single exon transcripts ? I am afraid StringTie cannot determine the strand for single-exon transcript assemblies (unless they happen to overlap some reference transcripts given with the -G option), and it won't even look for the XS tag for non-intron read alignments. The --rf or --fr options can be used to enforce a strand for single-exon assemblies but that should only be used for stranded libraries.

gpertea commented 4 years ago

Correction: I mistakenly stated above that StringTie:

won't even look for the XS tag for non-intron read alignments.

.. but that is not currently the case, StringTie should consider that tag for every single alignment now.

Maybe you can send me a small example bundle where you added the XS tag to all read alignments and yet StringTie outputs transcript assemblies with unknown strand?

camillaugolini-iit commented 4 years ago

Thank you, here's the head of the alignment file (every read in the sam file has the flag for reverse or sense alignment) :

here is an example of the alignment with no strand: 1 StringTie transcript 1059710 1059913 1000 . . gene_id "STRG.16"; transcript_id "STRG.16.1"; cov "1.720588"; FPKM "1.079395"; TPM "1.443259"; 1 StringTie exon 1059710 1059913 1000 . . gene_id "STRG.16"; transcript_id "STRG.16.1"; exon_number "1"; cov "1.720588"; if you need the entire file i can send it. Thanks again
gpertea commented 4 years ago

Posting SAM records like this directly as part of the text is not very functional. Based on your example assembly which seems to be at location 1:1059710-1059913, can you attach a small SAM (or better yet, a BAM file) with just the long reads from that region?

camillaugolini-iit commented 4 years ago

Here is the selection of reads from chromosome 1 that have the end coordinate < 1059913.

Hope it is useful.


Da: Geo Pertea notifications@github.com Inviato: lunedì 24 febbraio 2020 23:45:37 A: gpertea/stringtie Cc: Camilla Ugolini; Author Oggetto: Re: [gpertea/stringtie] stringtie does not assign the strand for sam files with setted XS tag (#262)

Posting SAM records like this directly as part of the text is not very functional. Based on your example assembly which seems to be at location 1:1059710-1059913, can you attach a small SAM (or better yet, a BAM file) with just the long reads from that region?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/gpertea/stringtie/issues/262?email_source=notifications&email_token=ANYVRMWQATC3DQBHUAGBILDREREZDA5CNFSM4KZEOPVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEMZ2NMI#issuecomment-590587569, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANYVRMSEW3IC526WVYTB63LREREZDANCNFSM4KZEOPVA.

gpertea commented 4 years ago

It seems the attachment is missing here (didn't work?), if the file is too large to attach here in github could you perhaps upload it to a file sharing service (Google Drive or even Firefox Send if it's below 2.5GB). Compressing it (if it's SAM and not BAM) might help to keep the file size small. Thanks!

camillaugolini-iit commented 4 years ago

selected.bam.zip

Does this work?

gpertea commented 4 years ago

Yes it worked, thanks (but github had a big hiccup just minutes ago) I just ran stringtie on that selected.bam file (with 1999 alignments which do not have XS tags, but have minimap's own ts tags which StringTie recognizes):

../stringtie -v -L -o out.gtf selected.bam

The resulting GTF has 20 transcripts and they all have the transcription strand assigned (7 on reverse strand, 13 on the forward strand). So I am not sure what is happening on your side, why you are not getting a strand assigned there for the assembled transcripts? Perhaps you assigned the XS strand incorrectly ? (I guess you know that the ts to XS conversion is not straight forward -- i.e. you cannot just copy the + or - character, because minimap2 uses a different assignment there for alignments on the reverse genomic strand).

camillaugolini-iit commented 4 years ago

Sorry for the inconvenience, I'm sending the right file to your email.

gpertea commented 4 years ago

I got the full data and I see that indeed there are cases (72 out of 2262 transcripts) where the strand information is "lost" as you described. All these generated assemblies with undetermined strand are single-exon -- even though the input consists of long reads aligned with at least two "exons" and thus they all have a transcription strand assigned by minimap2 (the ts tag is there).

I looked at a few of these cases and it seems that StringTie refuses to consider some of those spliced alignments as valid in cases where the splice sites are very inconsistent (i.e. farther than a few bases apart) between multiple spliced alignments overlapping across that region. As you know long reads are quite noisy so when StringTie encounters a low-coverage bundle where the splice sites between two exons are inconsistent between the read alignments in that region it might split/trim the assembly there, invalidate the splice sites and as a result the output may show a single (terminal) exon transfrag with no strand information (because the strand was considered to have been assigned by the splice sites, but if those are found to not be valid.. the strand is invalidated as well).

It's not easy to fix that outcome when alignments are "messy" like that. The only way to correct this output is to provided better, more consistent spliced alignments (if those are possible, of course). Given the error rate, if that intron were real, the corresponding splice sites in each alignment should not be that far apart from each other (they were tens of bases apart in the examples that I reviewed, essentially invalidating each-other).