gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
198 stars 32 forks source link

Number of loci and mRNAs in reference keeps changing #84

Closed gbdias closed 9 months ago

gbdias commented 9 months ago

Hi,

gffread: 0.12.7 gffcompare: 0.12.6

When doing multiple comparisons of GFFs against the same reference file (even in the same gffcompare command) the number of Reference mRNAs: in the stats file keeps changing. I even tried to cluster the reference GFF beforehand but it seems impossible to pin these numbers down. Any tips?

# gffcompare v0.12.6 | Command line was:
#gffcompare -R -r dm6_pcgs_clustered.gff3 -o bla /projects/guilherme/dmel/annotation/braker_r/braker_chrfix.gff3 /projects/guilherme/dmel/annotation/braker_p/braker_chrfix.gff3 /projects/guilherme/dmel/annotation/braker_rp/braker_chrfix.gff3
#

#= Summary for dataset: /projects/guilherme/dmel/annotation/braker_r/braker_chrfix.gff3
#     Query mRNAs :   16767 in   14659 loci  (13932 multi-exon transcripts)
#            (1643 multi-transcript loci, ~1.1 transcripts per locus)
# Reference mRNAs :   27624 in   12746 loci  (25244 multi-exon)
# Super-loci w/ reference transcripts:    12164
#-----------------| Sensitivity | Precision  |
        Base level:    63.6     |    97.1    |
        Exon level:    47.3     |    57.9    |
      Intron level:    69.3     |    88.0    |
Intron chain level:    27.5     |    49.8    |
  Transcript level:    27.8     |    45.8    |
       Locus level:    57.2     |    50.1    |

     Matching intron chains:    6938
       Matching transcripts:    7676
              Matching loci:    7286

          Missed exons:   11537/74577   ( 15.5%)
           Novel exons:    3628/60877   (  6.0%)
        Missed introns:    8491/57591   ( 14.7%)
         Novel introns:    3178/45338   (  7.0%)
           Missed loci:       0/12746   (  0.0%)
            Novel loci:    1124/14659   (  7.7%)

#= Summary for dataset: /projects/guilherme/dmel/annotation/braker_p/braker_chrfix.gff3
#     Query mRNAs :   16365 in   14992 loci  (12972 multi-exon transcripts)
#            (1031 multi-transcript loci, ~1.1 transcripts per locus)
# Reference mRNAs :   27031 in   12485 loci  (24633 multi-exon)
# Super-loci w/ reference transcripts:    12421
#-----------------| Sensitivity | Precision  |
        Base level:    65.3     |    96.8    |
        Exon level:    49.6     |    62.5    |
      Intron level:    69.4     |    92.3    |
Intron chain level:    30.3     |    57.6    |
  Transcript level:    30.9     |    51.1    |
       Locus level:    65.0     |    54.7    |

     Matching intron chains:    7470
       Matching transcripts:    8364
              Matching loci:    8113

          Missed exons:   11746/72741   ( 16.1%)
           Novel exons:    3415/57550   (  5.9%)
        Missed introns:    9730/56106   ( 17.3%)
         Novel introns:    1668/42205   (  4.0%)
           Missed loci:       0/12485   (  0.0%)
            Novel loci:    1267/14992   (  8.5%)

#= Summary for dataset: /projects/guilherme/dmel/annotation/braker_rp/braker_chrfix.gff3
#     Query mRNAs :   15414 in   13364 loci  (12186 multi-exon transcripts)
#            (1431 multi-transcript loci, ~1.2 transcripts per locus)
# Reference mRNAs :    4865 in    2375 loci  (4290 multi-exon)
# Super-loci w/ reference transcripts:     2368
#-----------------| Sensitivity | Precision  |
        Base level:    65.1     |    18.2    |
        Exon level:    48.2     |    11.6    |
      Intron level:    70.5     |    17.0    |
Intron chain level:    35.3     |    12.4    |
  Transcript level:    34.8     |    11.0    |
       Locus level:    66.9     |    12.0    |

     Matching intron chains:    1516
       Matching transcripts:    1691
              Matching loci:    1588

          Missed exons:    2202/12732   ( 17.3%)
           Novel exons:   43344/52499   ( 82.6%)
        Missed introns:    1909/9398    ( 20.3%)
         Novel introns:   32395/39070   ( 82.9%)
           Missed loci:       0/2375    (  0.0%)
            Novel loci:   10892/13364   ( 81.5%)

 Total union super-loci across all input datasets: 24607
  (5349 multi-transcript, ~1.4 transcripts per locus)
35458 out of 35458 consensus transcripts written in bla.combined.gtf (0 discarded as redundant)
gbdias commented 9 months ago

Sorry, I found out the reason. For anyone else searching it's the -R option. Removing it makes the number of reference transcripts consistent between all comparisons.