gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
199 stars 32 forks source link

Interpreting sensitivity and precision #7

Closed jstrohm closed 7 years ago

jstrohm commented 7 years ago

Hello,

I mapped my RNA-seq reads with HISAT2 and assembled them with stringtie while using the Salmo salar whole genome as a reference in both stages. I understand that "Sensitivity is the proportion of coding nucleotides that have been correctly predicted as coding, and Specificity is the proportion of noncoding nucleotides that have been correctly predicted as noncoding." (From Burset and Guigo 1996)

Is Precision synonymous with Specificity in this case?

What could lead to the discrepancy between Sensitivity and Precision seen in my data? Thanks!

gffcompare v0.9.8 | Command line was: ./gffcompare -r /media/genetics/Data1/Jeff_workstation_1/subread-1.5.0-p1/GCF_000233375.1_ICSASG_v2_genomic.gff -o /media/genetics/Data1/Jeff_workstation_1/gffcompare-0.9.8.Linux_x86_64/compare/8-B7877_S4_1a_mRNA-e-disabled /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled-e-disabled/8-B7877_S4_1a_mRNA-e-disabled.gtf

= Summary for dataset: /media/genetics/Data1/Jeff_workstation_1/stringtie-1.3.1c.Linux_x86_64/outputs/assembled-e-disabled/8-B7877_S4_1a_mRNA-e-disabled.gtf Query mRNAs : 20484 in 19076 loci (13136 multi-exon transcripts) (998 multi-transcript loci, ~1.1 transcripts per locus) Reference mRNAs : 136039 in 81267 loci (111148 multi-exon) Super-loci w/ reference transcripts: 12245 -----------------| Sensitivity | Precision | Base level: 12.2 | 86.9 | Exon level: 8.9 | 73.9 | Intron level: 9.5 | 96.7 | Intron chain level: 5.7 | 48.6 | Transcript level: 5.6 | 37.3 | Locus level: 8.3 | 35.4 |

 Matching intron chains:    6378
   Matching transcripts:    7647
          Matching loci:    6744

      Missed exons:  498437/562650  ( 88.6%)
       Novel exons:    5960/66229   (  9.0%)
    Missed introns:  421469/471248  ( 89.4%)
     Novel introns:     935/46255   (  2.0%)
       Missed loci:   68711/81267   ( 84.5%)
        Novel loci:    4714/19076   ( 24.7%)

Total union super-loci across all input datasets: 17075

jstrohm commented 7 years ago

I'm aware that I have low coverage since there are RNA pools from several samples in a single MiSeq run (this is an exploratory proof-of-concept). I'm guessing that sensitivity is related to coverage of the whole transcriptome, while accuracy only pertains to the transcripts that were assembled?

gpertea commented 7 years ago

So yes, Precision replaces the usage of "Specificity" from that old paper (and from old gene finding terminology), which nowadays has a different meaning. In gffcompare Precision is computed as TP/(TP+FP) so it only pertains to the assembled transcripts (because TP+FP = all assembled transcripts' data), while Sensitivity is computed as TP/(TP+FN), so the false negatives here depend heavily on the number of missed reference transcript data -- which automatically depends on the number of given reference transcripts. In the case of RNA-Seq experiments, where not all known transcripts are expressed anyway, the Sensitivity value is expected to be very low if one uses the whole known genome annotation as a reference data set -- so in order to get a better estimate for Sensitivity one should restrict the reference transcript set as to get it as close as possible to the set of reference transcripts expressed in the RNA-Seq sample.

One would generally use gffcompare with the -R option in order to get a better estimate for Sensitivity there -- this option instructs gffcompare to discard all reference transcripts which were not "hit" at all by any assembled transfrag -- which is as close as "guessing" the expressed set of transcripts as we can get here.. I know one might argue that -R should be the default -- and perhaps it should, though originally for basic simulation experiments we did not need it at all -- and we were mostly interested in the accuracy of transcript reconstruction (i.e. intron chain & transcript precision).

jstrohm commented 7 years ago

Brilliant, thank you! Yes I actually used -R yesterday since I thought it more accurately reflected my experiment type, but then wrongly assumed it was leading to my 100% sensitivity and precision issue, and I forgot about it. Must be Friday...

Thanks again!