gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
198 stars 32 forks source link

Intron chain level sensitivity is greater than 100% #54

Closed junghoon-shin closed 4 years ago

junghoon-shin commented 4 years ago

I followed step by step the protocol described in https://doi.org/10.1038/nprot.2016.095 on my 1,359 RNA-Seq bam files. The resulting stats file from gffcompare is as below:

#     Query mRNAs :  613028 in   89757 loci  (566077 multi-exon transcripts)
#            (29763 multi-transcript loci, ~6.8 transcripts per locus)
# Reference mRNAs :  194187 in   54800 loci  (169307 multi-exon)
# Super-loci w/ reference transcripts:    46624
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    27.7    |
        Exon level:    84.2     |    46.6    |
      Intron level:    99.8     |    51.1    |
Intron chain level:   100.7     |    30.1    |
  Transcript level:    56.4     |    17.9    |
       Locus level:    84.1     |    46.5    |

     Matching intron chains:  170540
       Matching transcripts:  109545
              Matching loci:   46097

          Missed exons:       0/559962  (  0.0%)
           Novel exons:  277224/1116827 ( 24.8%)
        Missed introns:     645/343915  (  0.2%)
         Novel introns:  137860/671963  ( 20.5%)
           Missed loci:       0/54800   (  0.0%)
            Novel loci:   43133/89757   ( 48.1%)

 Total union super-loci across all input datasets: 89757 

As shown, the intron chain level sensitivity is 100.7%. Is this valid?

gpertea commented 4 years ago

Can you tell me which version of gffcompare was that, and what was the command line used? (These should be shown at the top of that stats file)

I remember encountering that problem in the past, in the rare situations when there were "duplicate" entries in the input query file, i.e. transcripts with the exact same intron chain -- those may be counted multiple times. That's because they were not expected to be in the input -- most transcript assemblers/predictors should not generate such "redundant" isoforms. Was that transcript file generated by StringTie (and if so, which version?).

junghoon-shin commented 4 years ago

I used gffcompare v0.11.6. Below I pasted the whole content of the stats file.

# gffcompare v0.11.6 | Command line was:
#gffcompare -r <GENCODE_GTF> -o <PREFIX> merged.gtf
#

#= Summary for dataset: <INPUT_GTF> 
#     Query mRNAs :  613028 in   89757 loci  (566077 multi-exon transcripts)
#            (29763 multi-transcript loci, ~6.8 transcripts per locus)
# Reference mRNAs :  194187 in   54800 loci  (169307 multi-exon)
# Super-loci w/ reference transcripts:    46624
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    27.7    |
        Exon level:    84.2     |    46.6    |
      Intron level:    99.8     |    51.1    |
Intron chain level:   100.7     |    30.1    |
  Transcript level:    56.4     |    17.9    |
       Locus level:    84.1     |    46.5    |

     Matching intron chains:  170540
       Matching transcripts:  109545
              Matching loci:   46097

          Missed exons:       0/559962  (  0.0%)
           Novel exons:  277224/1116827 ( 24.8%)
        Missed introns:     645/343915  (  0.2%)
         Novel introns:  137860/671963  ( 20.5%)
           Missed loci:       0/54800   (  0.0%)
            Novel loci:   43133/89757   ( 48.1%)

 Total union super-loci across all input datasets: 89757 
613028 out of 613028 consensus transcripts written in <PREFIX>.annotated.gtf (0 discarded as redundant)

As written in the last line of the stats file content, I think there was no redundant transcript in in the input file. I used StringTie 2.1.3b. First I ran StringTie for each of the 1,359 input bam files, and then I merged all output GTF files using stringtie --merge. I ran gffcompare using this merged GTF file as input. Below I copy-pasted my command used.

while read bam
do
       stringtie "$bam" -p 1 -G <GENCODE_GTF> -o <OUTPUT_FILENAME> &
       <OUTPUT_FILENAME> >> stringtie_output_gtf_list.txt
done < input_bam_list.txt
wait

stringtie --merge -p 20 -G <GENCODE_GTF> -o merged.gtf stringtie_output_gtf_list.txt

gffcompare -r <GENCODE_GTF> -o <PREFIX> merged.gtf
gpertea commented 4 years ago

After further investigation, it seems to me this actually happened due to a bug in the intron-level (and transcript level) accuracy calculation, introduced by some code changes in v0.11.6. I just fixed those problems (and made transcript-level accuracy to not be so stringent by default, like it was before v0.11.6), so could you please try the new release (v0.11.7) on the same data set and let me know if the problem is still there (should not be). This latest release can be downloaded from https://github.com/gpertea/gffcompare/releases/tag/v0.11.7 or from http://ccb.jhu.edu/software/stringtie/gffcompare.shtml#gffcompare_dl

Thank you for reporting this issue!

junghoon-shin commented 4 years ago

Thank you for the update. I tried v0.11.7 and it seems that the intron chain level accuracy issue has been solved. Below is the new stats file.

# gffcompare v0.11.7 | Command line was:
#gffcompare -r <GENCODE_GTF> -o <PREFIX> stringtie_merged.gtf
#

#= Summary for dataset: stringtie_merged.gtf 
#     Query mRNAs :  613028 in   89757 loci  (566077 multi-exon transcripts)
#            (29763 multi-transcript loci, ~6.8 transcripts per locus)
# Reference mRNAs :  194187 in   54800 loci  (169307 multi-exon)
# Super-loci w/ reference transcripts:    46624
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    27.7    |
        Exon level:    84.2     |    46.6    |
      Intron level:    99.8     |    51.1    |
Intron chain level:    99.9     |    29.9    |
  Transcript level:    98.2     |    31.1    |
       Locus level:    95.7     |    50.7    |

     Matching intron chains:  169199
       Matching transcripts:  190672
              Matching loci:   52417

          Missed exons:       0/559962  (  0.0%)
           Novel exons:  277224/1116827 ( 24.8%)
        Missed introns:     645/343915  (  0.2%)
         Novel introns:  137860/671963  ( 20.5%)
           Missed loci:       0/54800   (  0.0%)
            Novel loci:   43133/89757   ( 48.1%)

 Total union super-loci across all input datasets: 89757 
613028 out of 613028 consensus transcripts written in <PREFIX>.annotated.gtf (0 discarded as redundant)

I got three new questions.

  1. The transcript level precision is now larger than the intron chain level precision. Is this intended?
  2. I guess that the definition of transcript level matching described in the documentation should be changed. What would be the updated definition?
  3. Also, I would appreciate it if you could recommend which one to use (between intron level and transcript level) as a general measure of accuracy estimate of a transcript assembler.
gpertea commented 4 years ago
  1. Yes, it's because 'transcript level' includes all those matching single-exon transcripts
  2. I updated the documentation with an explanation for how that works. The matching criteria for SETs (single exon transcripts) are hinted at in the "Query vs Reference transcripts" section below there.
  3. Transcript Level would be most comprehensive as it takes into account single-exon transcripts as well
junghoon-shin commented 4 years ago

Thank you for clearing them up and developing this nice tool!

junghoon-shin commented 4 years ago

One more question: Does the class code '=' represent transcript-level matching or intron chain-level matching? i.e. How is the class code assigned for SETs?

gpertea commented 4 years ago

"class codes" always refer to the way query transcripts positionally or structurally relate to reference transcripts. The strongest relationship is "matching" (class code =). For METs, as you guessed, the class code = (and thus "matching") really means that their intron chains are identical (by default; this changes if --strict-match option is used, but that's a special/experimental case, let's talk the default usage).

For SETs assessing "matching" (class code =) is a complex problem (and poorly defined) because the transcript boundaries will never simply "match", transcription can start/end at various distances, there is no intron chain there to anchor the "matching" assessment so I had to come up with some arbitrary criteria for a "fuzzy" matching threshold. As I mentioned those criteria are actually hinted at in the "Query vs Reference transcripts" section of the manual (http://ccb.jhu.edu/software/stringtie/gffcompare.shtml#gffcompare_qr), even though in that paragraph I described how duplicate/redundant query SETs are found when loading the "query" transcripts file, that explanation also applies to how query SETs are "matched" (i.e. assigned class code =) to the overlapping reference SETs, it's when

the overlapping region is >= 80% of the length of the larger transcript OR ( >=70% of the length of larger transcript AND >= 80% of the length of the shorter transcript) The latter clause seems a bit redundant but that particular check is also used somewhere else in gffcompare, namely for the "reverse containment" cases when we wanted to relax the way we assess a "match" with a reference when the reference is shorter than the query and possibly "almost contained" within the query

As you see in that paragraph there are also hints about how other somewhat related class codes for SETs are assigned (like c and k, i.e. containment and reverse containment).

junghoon-shin commented 4 years ago

Thank you very much for the detailed explanation. Based on your explanation and the documentation, my understanding is as follows (assuming that gffcompare v0.11.7 was used with default options).

  1. For multi-exon transcripts, the criteria for 'transcript level' matching and 'intron chain level' matching are identical.
  2. For single-exon transcripts, the criteria for 'transcript level' matching between a query transcript and a reference transcript are (1) overlapping region is >= 80% of the length of the larger transcript OR (2) ( >=70% of the length of larger transcript AND >= 80% of the length of the shorter transcript).
  3. Single-exon transcripts are completely ignored when calculating the 'intron chain level' sensitivity and precision.
  4. Class code '=' is assigned to a query transcript IF AND ONLY IF that query transcript is matched to a reference transcript by 'transcript level' matching criteria.
  5. Sensitivity and precision estimates displayed in the .stats file are calculated after removing all 'redundant' transcripts from both the reference transcript set and the query transcript set.
  6. The .annotated.gtf and .tracking files display all query transcripts (regardless of whether they are redundant or not).
  7. Because of 5 and 6, the 'transcript level' precision estimate displayed in the .stats file can differ from the proportion of query transcripts with class code '=' among all query transcripts shown in the .tracking file.

Please let me know if any of the above is wrong.

gpertea commented 4 years ago

I think 6 is incorrect, the redundant transcripts are simply discarded and no longer used in further processing and output. I am expecting now 7 might point to some inconsistency you found between the precision value and the = codes ? The truth is those class codes are assigned/re-evaluated separately (in a different code that considers all other class codes), it's not the same as the quick "matching"-only code used for precision/accuracy stats and I suppose it is possible that the latest changes might have created a discrepancy there, especially when it comes to SETs (those are always a problem as "matching" may be inconsistently defined for them when 'c' and 'k' class codes are also considered).

junghoon-shin commented 4 years ago

Yes, I asked 6 and 7 because I found a discrepancy between the precision value shown in the .stats file and the actual proportion of query transcripts with class code '=' in the .tracking file. I can now understand why. Thank you very much for the explanation.