Unspecific strand - Githubissues

francicco commented 5 years ago

Hi,

I'm using stringtie to assemble a transcriptome using a reference annotation made with a predict annotation containing only CDS regions.

This is the command line I'm using:

stringtie $GENOME.RNAonAnn.Aligned.sortedByCoord.out.bam -G $ANNOTATION.stringtie.gff3 --fr \
    -l $SPECIES -o $SPECIES.stringtie.out.gtf -p $THREADS -C $SPECIES.stringtie.cov \
    -A $SPECIES.gene_abund.out -B

Everything seems to work, but for some loci transcripts are not consistent with the guide and strand

In this figure you can see the first track as the Augustus Prediction Annotation (APA), a long + loci. the second track is the stringtie annotation (SA) using the APA as guide (-G). The third is the previous SA + predicted CDS (SAC). As you can see there are:

Multiple transcripts in place of a single locus
Multiple transcripts in the opposite strand (-)

Make somehow sense that based on evidence stringtie split the single long locus into multiple loci, but the minus strand transcripts seems out of place; that happens a lot of times.

Am I doing something wrong? Thanks for your help F

francicco commented 5 years ago

And indeed they are at least two loci

But those minus strands are wrong.

gpertea commented 5 years ago

This might be related to the use of --fr option there.. The information conveyed by that option should rather be encoded in the alignments themselves (by the proper XS tag setting) instead of "forcing" StringTie to assign a specific orientation. (I am starting to think it might not have been a good idea to have added those library "strandness" options to stringtie, as the aligner should be taking care of that in the first place).

How were the read alignments generated ? Are they oriented properly ? If they are, and the spliced alignments have the XS tag assigned properly, have you tried not using the --fr option and see if the strand inversion is still there?

francicco commented 5 years ago

Thank you @gpertea, I'm going to try now. F

francicco commented 5 years ago

Didn't improved, actually made it worst, stringtie did not create junctions!

The alignment was done using STAR

STAR --genomeDir $GENOMEDIR/$SPECIES.STAR.Index $limitBAMsortRAM --twopassMode Basic \
--readFilesIn $RNAR1.fastq $RNAR2.fastq --outFilterType BySJout              \
--outSAMattributes All --outSAMtype BAM SortedByCoordinate                   \
--runThreadN 16 --alignEndsType Local --outStd Log                           \
--outFileNamePrefix $GENOME.RNAonAnn. --sjdbGTFfile $ANNOTATION.hints.gtf

F

gpertea commented 5 years ago

I see -- it looks like the XS tag is missing for the read alignments, as all the spliced alignments got discarded.. I am not familiar with STAR options but I think you did not use the proper option that would make STAR write the XS tags, which are required if you planned to use RNA-Seq transcripts assemblers like Cufflinks or StringTie with those alignments. A quick google search suggests the option --outSAMstrandField intronMotif should be used to get those XS tags added, not sure how up to date that answer was.. Please check with the STAR documentation about that.

francicco commented 5 years ago

Ok, thanks. I'll give it a look! F

gpertea commented 5 years ago

It's still puzzling that the --fr option did not seem to work properly there. I'd be curious if --rf would've done something else there. I guess it is possible that the way we implemented these options does not quite work as it should in some cases -- we did not get to test these options properly as we didn't have a lot of this kind of data and when we did we relied on the aligner (HISAT2) to assign the XS tag accordingly.

francicco commented 5 years ago

Should I used that mapper? F

On Fri, 2 Nov 2018, 14:24 Geo Pertea <notifications@github.com wrote:

It's still puzzling that the --fr option did not seem to work properly there. I'd be curious if --rf would've done something else there. I guess it is possible that the way we implemented these options does not quite work as it should in some cases -- we did not get to test these options properly as we didn't have a lot of this kind of data and when we did we relied on the aligner (HISAT2) to assign the XS tag accordingly.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gpertea/stringtie/issues/204#issuecomment-435396629, or mute the thread https://github.com/notifications/unsubscribe-auth/AIlvFoOWlNnAqSHSDoOkZnGhCYCgOti-ks5urFWhgaJpZM4YLiFL .

gpertea commented 5 years ago

Is this a trick question? :D Well, the StringTie protocol (if you check the papers or the web page) was based specifically on HISAT2 alignments. So our very understandably biased answer is "yes", of course.. However you should still be able to use other aligners like STAR if you prefer, just make sure you provide the right options such that the generated alignments are suitable for downstream transcript assembly. STAR used to mention Cufflinks in its documentation for such specific options, and StringTie uses the same input as Cufflinks -- so keep this in mind when reading the STAR documentation.

francicco commented 5 years ago

HAHAHAHAH Ok, I'll check HISAT2 too, let's see which perform better! ;) F

francicco commented 5 years ago

If only I could download it. It seems the server is down!

gpertea commented 5 years ago

Ah, our annoying FTP server is acting up again.. I switched those downloads to http, please try reloading the hisat2 web page and click the donwload links again.. (only software, not the data files)

francicco commented 5 years ago

Ok, works

gpertea commented 5 years ago

Keep in mind a few notes when using HISAT2 (and use the example StringTie2 protocol for guidance, alongside the HISAT2 manual).

if you want to use reference annotation as hints for the aligner (as you did for STAR using the --sjdbGTFfile option), remember to build the genome index using hisat2-build with the --ss and --exon options as documented in the HISAT2 manual (unfortunately you'll have to use the utility scripts included with HISAT2 to extract those input files separately from your $ANNOTATION.hints.gtf file; I know, it's unfortunate that hisat2-build is not able to take a GTF file directly for this..)
HISAT2, as opposed to STAR, does have options for strand-specific libraries (check --rna-strandness option in the manual). However if that strandness info is not very reliable in your sample (?) sometimes it is safer to just omit it, as the spliced alignments found by HISAT2 should automatically point to the correct orientation anyway..
remember to use --dta option (or, even more conservative, --dta-cufflinks) when you run HISAT2 on your reads

francicco commented 5 years ago

Ok! Thank you very much! Much appreciated! F

francicco commented 5 years ago

Ok, I fixed the STAR run, making it compatible and I also have the alignment with HISAT2. I run Stringtie for both alignments a on/off --fr, for a total of 4 annotation. They are slightly different.

Do you think would make sense to merge them?

I have two more questions

it looks like some original annotation are discarded (see the figure), If I use stringtie merge to include them in the final annotation would it work? Because I would get also rid of merged loci, like the one at the beginning of this thread.

And, as stringtie seems to extend the annotation with UTRs, the original CDS only annotation will be included or not by stringtie merge?

I know I can visually check for both things, but I'd like to know what I should expect.

Thanks a lot! F

gpertea commented 5 years ago

No, please don't merge the 4 BAM files (or their stringtie outputs), it's actually messing up the results instead of helping.. You should pick only one aligner/strategy -- you could use gffcompare to check the transcript discovery accuracy (as shown in the example here: http://ccb.jhu.edu/software/stringtie/gff.shtml#gffcompare). Unfortunately the best way to decide which aligning strategy works best using those gffcompare stats would be to repeat those 4 alignment runs but without the CDS-only annotation, use stringtie to assemble them separately, and after that look at the gffcompare accuracy at intron and intron-chain levels using the CDS-only annotation as reference (-r) for gffcompare. Even checking gffcompare stats with the existing 4 data sets you already got there may show some small differences (again, look only at the intron level and intron-chain level stats since you only have CDS annotation) but likely the results would be too close to call between STAR and HISAT2 if both used about the same guide splice sites etc.

As for the other questions:

couldn't quite see from the picture which were the discarded reference transcripts but yes, stringtie --merge should put those back in play for abundance estimation.
- yes, CDSs will be merged with the intron-compatible transcripts assembled by stringtie --merge in the process.

But again, you shouldn't use stringtie --merge unless you have multiple distinct samples to combine (not the same sample aligned with 4 strategies).

francicco commented 5 years ago

Hi @gpertea,

I perfectly see your point later yesterday I had the same thought, do not merge all assembly, pick one. So your idea is to align reads with both aligner without the reference annotation and afterwards check which strategy gives the best score with the reference annotation, right? Agree, makes totally sense to me!

About the second point, maybe I could just reintroduce the annotations for loci without coverage. In the figure they are those at the middle in the first annotation track.

Thanks a lot for your help. F

francicco commented 5 years ago

#= Summary for dataset: Djun.hisat2.stringtie.out.gtf :
#     Query mRNAs :   16140 in   10196 loci  (14151 multi-exon transcripts)
#            (3289 multi-transcript loci, ~1.6 transcripts per locus)
# Reference mRNAs :   13201 in   11916 loci  (11405 multi-exon)
# Super-loci w/ reference transcripts:     8189
#--------------------|   Sn   |  Sp   |  fSn |  fSp  
        Base level:      84.6    58.8     -       - 
        Exon level:      62.1    64.2    63.9    66.1
      Intron level:      81.9    85.4    82.7    86.3
Intron chain level:      22.4    18.0    65.7    52.9
  Transcript level:       0.0     0.0     0.0     0.0
       Locus level:      20.4    23.8    43.7    40.2

     Matching intron chains:    2552
              Matching loci:    2426

          Missed exons:    9576/80103   ( 12.0%)
           Novel exons:    6777/77420   (  8.8%)
        Missed introns:    8046/68120   ( 11.8%)
         Novel introns:    4148/65307   (  6.4%)
           Missed loci:       0/11916   (  0.0%)
            Novel loci:    1388/10196   ( 13.6%)

#= Summary for dataset: Djun.hisat2.fr.stringtie.out.gtf :
#     Query mRNAs :   23928 in   10181 loci  (22048 multi-exon transcripts)
#            (5064 multi-transcript loci, ~2.4 transcripts per locus)
# Reference mRNAs :   13464 in   12172 loci  (11594 multi-exon)
# Super-loci w/ reference transcripts:     8208
#--------------------|   Sn   |  Sp   |  fSn |  fSp  
        Base level:      84.8    53.5     -       - 
        Exon level:      61.4    56.7    64.3    59.3
      Intron level:      82.4    79.1    83.9    80.5
Intron chain level:      23.4    12.3    81.7    43.0
  Transcript level:       0.0     0.0     0.0     0.0
       Locus level:      21.1    25.2    45.0    42.0

     Matching intron chains:    2714
              Matching loci:    2571

          Missed exons:    9443/80853   ( 11.7%)
           Novel exons:    9114/87642   ( 10.4%)
        Missed introns:    7710/68611   ( 11.2%)
         Novel introns:    5339/71470   (  7.5%)
           Missed loci:       0/12172   (  0.0%)
            Novel loci:    1362/10181   ( 13.4%)

#= Summary for dataset: Djun.STAR.stringtie.out.gtf :
#     Query mRNAs :   16342 in   10598 loci  (14061 multi-exon transcripts)
#            (3228 multi-transcript loci, ~1.5 transcripts per locus)
# Reference mRNAs :   12375 in   11113 loci  (10967 multi-exon)
# Super-loci w/ reference transcripts:     8420
#--------------------|   Sn   |  Sp   |  fSn |  fSp  
        Base level:      87.6    57.5     -       - 
        Exon level:      62.6    64.3    65.0    66.8
      Intron level:      82.2    87.2    83.0    88.0
Intron chain level:      23.5    18.3    64.2    50.1
  Transcript level:       0.0     0.0     0.0     0.0
       Locus level:      22.2    23.2    44.1    37.7

     Matching intron chains:    2579
              Matching loci:    2467

          Missed exons:    8559/78791   ( 10.9%)
           Novel exons:    6103/76698   (  8.0%)
        Missed introns:    9071/67619   ( 13.4%)
         Novel introns:    3635/63746   (  5.7%)
           Missed loci:       0/11113   (  0.0%)
            Novel loci:    1564/10598   ( 14.8%)

#= Summary for dataset: Djun.STAR.fr.stringtie.out.gtf :
#     Query mRNAs :   24198 in   10602 loci  (22048 multi-exon transcripts)
#            (5119 multi-transcript loci, ~2.3 transcripts per locus)
# Reference mRNAs :   12563 in   11295 loci  (11129 multi-exon)
# Super-loci w/ reference transcripts:     8477
#--------------------|   Sn   |  Sp   |  fSn |  fSp  
        Base level:      88.3    52.3     -       - 
        Exon level:      62.2    56.5    66.3    60.2
      Intron level:      83.1    81.3    84.4    82.5
Intron chain level:      24.9    12.6    79.4    40.1
  Transcript level:       0.0     0.0     0.0     0.0
       Locus level:      23.3    24.8    46.1    39.8

     Matching intron chains:    2774
              Matching loci:    2633

          Missed exons:    8145/79448   ( 10.3%)
           Novel exons:    8063/87460   (  9.2%)
        Missed introns:    8601/68092   ( 12.6%)
         Novel introns:    4658/69632   (  6.7%)
           Missed loci:       0/11295   (  0.0%)
            Novel loci:    1521/10602   ( 14.3%)

What would you pick? STAR with Stringtie -fr (Djun.STAR.fr.stringtie.out.gtf) seems to do a pretty good job. F

gpertea commented 5 years ago

Yes, tough call.. that STAR.fr run does indeed have the best sensitivity at intron/chain level, though precision (Sp) is not that great.. Quite strange that some of the Sp values are so much better when no strand constraints are applied.

francicco commented 5 years ago

Maybe I can also explore hisat2.stringtie, the one with the higher Sp and see what gives me the more complete transcriptome? F

freedog8 commented 5 years ago

Ok, I fixed the STAR run, making it compatible and I also have the alignment with HISAT2. I run Stringtie for both alignments a on/off --fr, for a total of 4 annotation. They are slightly different.

Do you think would make sense to merge them?

I have two more questions

it looks like some original annotation are discarded (see the figure), If I use stringtie merge to include them in the final annotation would it work? Because I would get also rid of merged loci, like the one at the beginning of this thread.

And, as stringtie seems to extend the annotation with UTRs, the original CDS only annotation will be included or not by stringtie merge?

I know I can visually check for both things, but I'd like to know what I should expect.

Thanks a lot! F

hi @francicco Could you please tell me how you fixing the STAR run?

francicco commented 5 years ago

Try adding --outSAMattributes All Let me know F

freedog8 commented 5 years ago

Thanks a lot~~~ @francicco

freedog8 commented 5 years ago

But, it seems that the XS attributes is not included in the --outSAMattributes All option. Have you add the --outSAMstrandField intornMotif for stranded RNA-Seq? @francicco

francicco commented 5 years ago

Oh yes! Sorry! F

MACHARODRIGO commented 2 years ago

Hello! I was watching this discussion and I find very useful because I was dealying with the same problem. I have a question for a different type of analysis @gpertea In my case I analyzed 9 Bioprojects (around 84 samples) with the same workflow (trimming/hisat2/stringtie/stringtie-merge/gffcompare) and I obtained the following parameters:

Summary for dataset: Stringtie-merge_transcriptome Query mRNAs : 154170 in 46165 loci (143808 multi-exon transcripts) (24576 multi-transcript loci, ~3.3 transcripts per locus) Reference mRNAs : 33874 in 24531 loci (29109 multi-exon) Super-loci w/ reference transcripts: 22065

	Sensitivity	Precision
Base level:	100.0	39.7
Exon level:	94.8	32.0
Intron level:	100.0	35.3
Intron chain level:	100.0	20.2
Transcript level:	99.4	21.8
Locus level:	99.1	47.5

 Matching intron chains:   29109
   Matching transcripts:   33657
          Matching loci:   24317

      Missed exons:       0/138290  (  0.0%)
       Novel exons:  233744/447225  ( 52.3%)
    Missed introns:       1/109111  (  0.0%)
     Novel introns:  169594/309266  ( 54.8%)
       Missed loci:       0/24531   (  0.0%)
        Novel loci:   24100/46165   ( 52.2%)

Total union super-loci across all input datasets: 46165 154170 out of 154170 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)

I should point out that that to obtain this Merge Transcriptome, what I did was to merge the GTFs obtained by StringTie for strand and unstrand libraries, which I analyzed individually (each bioproject) with the corresponding parameter (--fr or unstranded). Given an example:

Workflow

The first question that I have is if it is correct to merge the gtf files obtained by Stringtie proccessing unstranded and stranded samples.

When I compare with the reference I have a similar problem like francicco showed before:

The transcript are in the correct position but sometimes they are too long and occupy the place of another transcript.

Thanks in advance, I hope you understand me.

Rodrigo

gpertea / stringtie

Unspecific strand #204