Open francicco opened 5 years ago
And indeed they are at least two loci
But those minus strands are wrong.
This might be related to the use of --fr
option there.. The information conveyed by that option should rather be encoded in the alignments themselves (by the proper XS tag setting) instead of "forcing" StringTie to assign a specific orientation. (I am starting to think it might not have been a good idea to have added those library "strandness" options to stringtie, as the aligner should be taking care of that in the first place).
How were the read alignments generated ? Are they oriented properly ? If they are, and the spliced alignments have the XS tag assigned properly, have you tried not using the --fr option and see if the strand inversion is still there?
Thank you @gpertea, I'm going to try now. F
Didn't improved, actually made it worst, stringtie did not create junctions!
The alignment was done using STAR
STAR --genomeDir $GENOMEDIR/$SPECIES.STAR.Index $limitBAMsortRAM --twopassMode Basic \
--readFilesIn $RNAR1.fastq $RNAR2.fastq --outFilterType BySJout \
--outSAMattributes All --outSAMtype BAM SortedByCoordinate \
--runThreadN 16 --alignEndsType Local --outStd Log \
--outFileNamePrefix $GENOME.RNAonAnn. --sjdbGTFfile $ANNOTATION.hints.gtf
F
I see -- it looks like the XS tag is missing for the read alignments, as all the spliced alignments got discarded..
I am not familiar with STAR options but I think you did not use the proper option that would make STAR write the XS tags, which are required if you planned to use RNA-Seq transcripts assemblers like Cufflinks or StringTie with those alignments. A quick google search suggests the option --outSAMstrandField intronMotif
should be used to get those XS tags added, not sure how up to date that answer was.. Please check with the STAR documentation about that.
Ok, thanks. I'll give it a look! F
It's still puzzling that the --fr
option did not seem to work properly there. I'd be curious if --rf
would've done something else there. I guess it is possible that the way we implemented these options does not quite work as it should in some cases -- we did not get to test these options properly as we didn't have a lot of this kind of data and when we did we relied on the aligner (HISAT2) to assign the XS tag accordingly.
Should I used that mapper? F
On Fri, 2 Nov 2018, 14:24 Geo Pertea <notifications@github.com wrote:
It's still puzzling that the --fr option did not seem to work properly there. I'd be curious if --rf would've done something else there. I guess it is possible that the way we implemented these options does not quite work as it should in some cases -- we did not get to test these options properly as we didn't have a lot of this kind of data and when we did we relied on the aligner (HISAT2) to assign the XS tag accordingly.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gpertea/stringtie/issues/204#issuecomment-435396629, or mute the thread https://github.com/notifications/unsubscribe-auth/AIlvFoOWlNnAqSHSDoOkZnGhCYCgOti-ks5urFWhgaJpZM4YLiFL .
Is this a trick question? :D Well, the StringTie protocol (if you check the papers or the web page) was based specifically on HISAT2 alignments. So our very understandably biased answer is "yes", of course.. However you should still be able to use other aligners like STAR if you prefer, just make sure you provide the right options such that the generated alignments are suitable for downstream transcript assembly. STAR used to mention Cufflinks in its documentation for such specific options, and StringTie uses the same input as Cufflinks -- so keep this in mind when reading the STAR documentation.
HAHAHAHAH Ok, I'll check HISAT2 too, let's see which perform better! ;) F
If only I could download it. It seems the server is down!
Ah, our annoying FTP server is acting up again.. I switched those downloads to http, please try reloading the hisat2 web page and click the donwload links again.. (only software, not the data files)
Ok, works
Keep in mind a few notes when using HISAT2 (and use the example StringTie2 protocol for guidance, alongside the HISAT2 manual).
--sjdbGTFfile
option), remember to build the genome index using hisat2-build with the --ss
and --exon
options as documented in the HISAT2 manual (unfortunately you'll have to use the utility scripts included with HISAT2 to extract those input files separately from your $ANNOTATION.hints.gtf file; I know, it's unfortunate that hisat2-build is not able to take a GTF file directly for this..)--rna-strandness
option in the manual). However if that strandness info is not very reliable in your sample (?) sometimes it is safer to just omit it, as the spliced alignments found by HISAT2 should automatically point to the correct orientation anyway..--dta
option (or, even more conservative, --dta-cufflinks
) when you run HISAT2 on your readsOk! Thank you very much! Much appreciated! F
Ok, I fixed the STAR run, making it compatible and I also have the alignment with HISAT2. I run Stringtie for both alignments a on/off --fr
, for a total of 4 annotation. They are slightly different.
Do you think would make sense to merge them?
I have two more questions
stringtie merge
to include them in the final annotation would it work? Because I would get also rid of merged loci, like the one at the beginning of this thread.stringtie merge
?I know I can visually check for both things, but I'd like to know what I should expect.
Thanks a lot! F
No, please don't merge the 4 BAM files (or their stringtie outputs), it's actually messing up the results instead of helping.. You should pick only one aligner/strategy -- you could use gffcompare to check the transcript discovery accuracy (as shown in the example here: http://ccb.jhu.edu/software/stringtie/gff.shtml#gffcompare). Unfortunately the best way to decide which aligning strategy works best using those gffcompare stats would be to repeat those 4 alignment runs but without the CDS-only annotation, use stringtie to assemble them separately, and after that look at the gffcompare accuracy at intron and intron-chain levels using the CDS-only annotation as reference (-r) for gffcompare. Even checking gffcompare stats with the existing 4 data sets you already got there may show some small differences (again, look only at the intron level and intron-chain level stats since you only have CDS annotation) but likely the results would be too close to call between STAR and HISAT2 if both used about the same guide splice sites etc.
As for the other questions:
But again, you shouldn't use stringtie --merge
unless you have multiple distinct samples to combine (not the same sample aligned with 4 strategies).
Hi @gpertea,
I perfectly see your point later yesterday I had the same thought, do not merge all assembly, pick one. So your idea is to align reads with both aligner without the reference annotation and afterwards check which strategy gives the best score with the reference annotation, right? Agree, makes totally sense to me!
About the second point, maybe I could just reintroduce the annotations for loci without coverage. In the figure they are those at the middle in the first annotation track.
Thanks a lot for your help. F
#= Summary for dataset: Djun.hisat2.stringtie.out.gtf :
# Query mRNAs : 16140 in 10196 loci (14151 multi-exon transcripts)
# (3289 multi-transcript loci, ~1.6 transcripts per locus)
# Reference mRNAs : 13201 in 11916 loci (11405 multi-exon)
# Super-loci w/ reference transcripts: 8189
#--------------------| Sn | Sp | fSn | fSp
Base level: 84.6 58.8 - -
Exon level: 62.1 64.2 63.9 66.1
Intron level: 81.9 85.4 82.7 86.3
Intron chain level: 22.4 18.0 65.7 52.9
Transcript level: 0.0 0.0 0.0 0.0
Locus level: 20.4 23.8 43.7 40.2
Matching intron chains: 2552
Matching loci: 2426
Missed exons: 9576/80103 ( 12.0%)
Novel exons: 6777/77420 ( 8.8%)
Missed introns: 8046/68120 ( 11.8%)
Novel introns: 4148/65307 ( 6.4%)
Missed loci: 0/11916 ( 0.0%)
Novel loci: 1388/10196 ( 13.6%)
#= Summary for dataset: Djun.hisat2.fr.stringtie.out.gtf :
# Query mRNAs : 23928 in 10181 loci (22048 multi-exon transcripts)
# (5064 multi-transcript loci, ~2.4 transcripts per locus)
# Reference mRNAs : 13464 in 12172 loci (11594 multi-exon)
# Super-loci w/ reference transcripts: 8208
#--------------------| Sn | Sp | fSn | fSp
Base level: 84.8 53.5 - -
Exon level: 61.4 56.7 64.3 59.3
Intron level: 82.4 79.1 83.9 80.5
Intron chain level: 23.4 12.3 81.7 43.0
Transcript level: 0.0 0.0 0.0 0.0
Locus level: 21.1 25.2 45.0 42.0
Matching intron chains: 2714
Matching loci: 2571
Missed exons: 9443/80853 ( 11.7%)
Novel exons: 9114/87642 ( 10.4%)
Missed introns: 7710/68611 ( 11.2%)
Novel introns: 5339/71470 ( 7.5%)
Missed loci: 0/12172 ( 0.0%)
Novel loci: 1362/10181 ( 13.4%)
#= Summary for dataset: Djun.STAR.stringtie.out.gtf :
# Query mRNAs : 16342 in 10598 loci (14061 multi-exon transcripts)
# (3228 multi-transcript loci, ~1.5 transcripts per locus)
# Reference mRNAs : 12375 in 11113 loci (10967 multi-exon)
# Super-loci w/ reference transcripts: 8420
#--------------------| Sn | Sp | fSn | fSp
Base level: 87.6 57.5 - -
Exon level: 62.6 64.3 65.0 66.8
Intron level: 82.2 87.2 83.0 88.0
Intron chain level: 23.5 18.3 64.2 50.1
Transcript level: 0.0 0.0 0.0 0.0
Locus level: 22.2 23.2 44.1 37.7
Matching intron chains: 2579
Matching loci: 2467
Missed exons: 8559/78791 ( 10.9%)
Novel exons: 6103/76698 ( 8.0%)
Missed introns: 9071/67619 ( 13.4%)
Novel introns: 3635/63746 ( 5.7%)
Missed loci: 0/11113 ( 0.0%)
Novel loci: 1564/10598 ( 14.8%)
#= Summary for dataset: Djun.STAR.fr.stringtie.out.gtf :
# Query mRNAs : 24198 in 10602 loci (22048 multi-exon transcripts)
# (5119 multi-transcript loci, ~2.3 transcripts per locus)
# Reference mRNAs : 12563 in 11295 loci (11129 multi-exon)
# Super-loci w/ reference transcripts: 8477
#--------------------| Sn | Sp | fSn | fSp
Base level: 88.3 52.3 - -
Exon level: 62.2 56.5 66.3 60.2
Intron level: 83.1 81.3 84.4 82.5
Intron chain level: 24.9 12.6 79.4 40.1
Transcript level: 0.0 0.0 0.0 0.0
Locus level: 23.3 24.8 46.1 39.8
Matching intron chains: 2774
Matching loci: 2633
Missed exons: 8145/79448 ( 10.3%)
Novel exons: 8063/87460 ( 9.2%)
Missed introns: 8601/68092 ( 12.6%)
Novel introns: 4658/69632 ( 6.7%)
Missed loci: 0/11295 ( 0.0%)
Novel loci: 1521/10602 ( 14.3%)
What would you pick? STAR with Stringtie -fr
(Djun.STAR.fr.stringtie.out.gtf
) seems to do a pretty good job.
F
Yes, tough call.. that STAR.fr run does indeed have the best sensitivity at intron/chain level, though precision (Sp) is not that great.. Quite strange that some of the Sp values are so much better when no strand constraints are applied.
Maybe I can also explore hisat2.stringtie
, the one with the higher Sp and see what gives me the more complete transcriptome?
F
Ok, I fixed the STAR run, making it compatible and I also have the alignment with HISAT2. I run Stringtie for both alignments a on/off
--fr
, for a total of 4 annotation. They are slightly different.Do you think would make sense to merge them?
I have two more questions
- it looks like some original annotation are discarded (see the figure), If I use
stringtie merge
to include them in the final annotation would it work? Because I would get also rid of merged loci, like the one at the beginning of this thread.
- And, as stringtie seems to extend the annotation with UTRs, the original CDS only annotation will be included or not by
stringtie merge
?I know I can visually check for both things, but I'd like to know what I should expect.
Thanks a lot! F
hi @francicco Could you please tell me how you fixing the STAR run?
Try adding --outSAMattributes All
Let me know
F
Thanks a lot~~~ @francicco
But, it seems that the XS attributes is not included in the --outSAMattributes All
option. Have you add the --outSAMstrandField intornMotif
for stranded RNA-Seq? @francicco
Oh yes! Sorry! F
Hello! I was watching this discussion and I find very useful because I was dealying with the same problem. I have a question for a different type of analysis @gpertea In my case I analyzed 9 Bioprojects (around 84 samples) with the same workflow (trimming/hisat2/stringtie/stringtie-merge/gffcompare) and I obtained the following parameters:
Summary for dataset: Stringtie-merge_transcriptome Query mRNAs : 154170 in 46165 loci (143808 multi-exon transcripts) (24576 multi-transcript loci, ~3.3 transcripts per locus) Reference mRNAs : 33874 in 24531 loci (29109 multi-exon) Super-loci w/ reference transcripts: 22065
Sensitivity | Precision | |
---|---|---|
Base level: | 100.0 | 39.7 |
Exon level: | 94.8 | 32.0 |
Intron level: | 100.0 | 35.3 |
Intron chain level: | 100.0 | 20.2 |
Transcript level: | 99.4 | 21.8 |
Locus level: | 99.1 | 47.5 |
Matching intron chains: 29109
Matching transcripts: 33657
Matching loci: 24317
Missed exons: 0/138290 ( 0.0%)
Novel exons: 233744/447225 ( 52.3%)
Missed introns: 1/109111 ( 0.0%)
Novel introns: 169594/309266 ( 54.8%)
Missed loci: 0/24531 ( 0.0%)
Novel loci: 24100/46165 ( 52.2%)
Total union super-loci across all input datasets: 46165 154170 out of 154170 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)
I should point out that that to obtain this Merge Transcriptome, what I did was to merge the GTFs obtained by StringTie for strand and unstrand libraries, which I analyzed individually (each bioproject) with the corresponding parameter (--fr or unstranded). Given an example:
The first question that I have is if it is correct to merge the gtf files obtained by Stringtie proccessing unstranded and stranded samples.
When I compare with the reference I have a similar problem like francicco showed before:
The transcript are in the correct position but sometimes they are too long and occupy the place of another transcript.
Thanks in advance, I hope you understand me.
Rodrigo
Hi,
I'm using stringtie to assemble a transcriptome using a reference annotation made with a predict annotation containing only CDS regions.
This is the command line I'm using:
Everything seems to work, but for some loci transcripts are not consistent with the guide and strand
In this figure you can see the first track as the Augustus Prediction Annotation (APA), a long + loci. the second track is the stringtie annotation (SA) using the APA as guide (-G). The third is the previous SA + predicted CDS (SAC). As you can see there are:
Make somehow sense that based on evidence stringtie split the single long locus into multiple loci, but the minus strand transcripts seems out of place; that happens a lot of times.
Am I doing something wrong? Thanks for your help F