gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
204 stars 32 forks source link

gffcompare shown wrong result when using strict-match #47

Open grassking100 opened 4 years ago

grassking100 commented 4 years ago

Hello, Recently, I was running the GFFcompare and found some confusing results. The query file has one single-exon transcript which is slightly different from the reference file at the boundary, and the query file has also one multiple-exon transcript which is slightly different from the reference file at the boundary. When I using GFFCompare with "--strict-match -e 0 -d 0", I was expected that the transcript level of two data should be 0% due to the error at the boundary, but the results have shown 100%. What are the possible reasons this might happens?

Thank you grassking100

gffcmp.stats

# gffcompare v0.11.5 | Command line was:
#gffcompare --strict-match --chr-stats --debug -T -e 0 -d 0 --no-merge -r subanswer.gff3 subpredict.gff3
#
#> Genomic sequence: region_0385 
#     Query mRNAs :       1 in       1 loci  (0 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       1 in       1 loci  (0 multi-exon)
# Super-loci w/ reference transcripts:        1
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.8    |
        Exon level:     0.0     |     0.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       0
       Matching transcripts:       1
              Matching loci:       1

          Missed exons:       0/1   (  0.0%)
           Novel exons:       0/1   (  0.0%)
           Missed loci:       0/1   (  0.0%)
            Novel loci:       0/1   (  0.0%)
#> Genomic sequence: region_0333 
#     Query mRNAs :       1 in       1 loci  (1 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       1 in       1 loci  (1 multi-exon)
# Super-loci w/ reference transcripts:        1
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.9    |
        Exon level:    66.7     |    66.7    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       1
       Matching transcripts:       1
              Matching loci:       1

          Missed exons:       0/3   (  0.0%)
           Novel exons:       0/3   (  0.0%)
        Missed introns:       0/2   (  0.0%)
         Novel introns:       0/2   (  0.0%)
           Missed loci:       0/1   (  0.0%)
            Novel loci:       0/1   (  0.0%)

#= Summary for dataset: subpredict.gff3 
#     Query mRNAs :       2 in       2 loci  (1 multi-exon transcripts)
#            (0 multi-transcript loci, ~1.0 transcripts per locus)
# Reference mRNAs :       2 in       2 loci  (1 multi-exon)
# Super-loci w/ reference transcripts:        2
#-----------------| Sensitivity | Precision  |
        Base level:   100.0     |    99.8    |
        Exon level:    50.0     |    50.0    |
      Intron level:   100.0     |   100.0    |
Intron chain level:   100.0     |   100.0    |
  Transcript level:   100.0     |   100.0    |
       Locus level:   100.0     |   100.0    |

     Matching intron chains:       1
       Matching transcripts:       2
              Matching loci:       2

          Missed exons:       0/4   (  0.0%)
           Novel exons:       0/4   (  0.0%)
        Missed introns:       0/2   (  0.0%)
         Novel introns:       0/2   (  0.0%)
           Missed loci:       0/2   (  0.0%)
            Novel loci:       0/2   (  0.0%)

 Total union super-loci across all input datasets: 2 
2 out of 2 consensus transcripts written in gffcmp.annotated.gtf (0 discarded as redundant)

subanswer.gff3

##gff-version 3
region_0385 .   gene    500 1432    .   +   .   ID=region_0769_gene_1;Parent=region_0769
region_0385 .   mRNA    500 1432    .   +   .   ID=region_0769_gene_1_mRNA;Parent=region_0769_gene_1
region_0385 .   exon    500 1432    .   +   .   ID=region_0769_gene_1_mRNA_exon_1;Parent=region_0769_gene_1_mRNA
###
region_0333 .   gene    501 2878    4   -   .   ID=AT2G26870_gene
region_0333 .   mRNA    501 2878    4   -   .   ID=AT2G26870;Parent=AT2G26870_gene
region_0333 .   exon    501 1315    4   -   .   Parent=AT2G26870
region_0333 .   exon    1409    1781    4   -   .   Parent=AT2G26870
region_0333 .   exon    2383    2878    4   -   .   Parent=AT2G26870

subpredict.gff3

##gff-version 3
region_0385 .   gene    500 1434    .   +   .   ID=region_0769_gene_1;Parent=region_0769
region_0385 .   mRNA    500 1434    .   +   .   ID=region_0769_gene_1_mRNA;Parent=region_0769_gene_1
region_0385 .   exon    500 1434    .   +   .   ID=region_0769_gene_1_mRNA_exon_1;Parent=region_0769_gene_1_mRNA
###
region_0333 .   gene    499 2878    .   -   .   ID=region_0666_gene_1;Parent=region_0666
region_0333 .   mRNA    499 2878    .   -   .   ID=region_0666_gene_1_mRNA;Parent=region_0666_gene_1
region_0333 .   exon    499 1315    .   -   .   ID=region_0666_gene_1_mRNA_exon_3;Parent=region_0666_gene_1_mRNA
region_0333 .   exon    1409    1781    .   -   .   ID=region_0666_gene_1_mRNA_exon_2;Parent=region_0666_gene_1_mRNA
region_0333 .   exon    2383    2878    .   -   .   ID=region_0666_gene_1_mRNA_exon_1;Parent=region_0666_gene_1_mRNA
gpertea commented 4 years ago

I agree, it does look inconsistent and it's likely buggy. There are a few confusing issues here, looking back at the code I wrote for those options, I noticed that:

So I'll have to fix the --strict-match option and I guess I should also make it affect the transcript level stats, as you expected.. As for -e , I think I should make it affect transcript level accuracy as well - as documented in the manual.

Thank you for reporting these bugs/inconsistencies while providing very nice and clear example files.