gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
198 stars 32 forks source link

Lost reference mRNA during multiple assembly comparison #55

Closed CIWa closed 4 years ago

CIWa commented 4 years ago

Hi,

Thank you very much for providing this handy tool! I used gffcompare to compare a merged assembly by scallop (v 0.10.5) + TACO (v 0.7.3), and one assembled and merged by Stringtie (v 2.1.4). I then realised that gffcompare seems to loose several reference mRNAs between the analysed assemblies. Furthermore, the numbers of lost mRNAs are inconsistent across variation in the order that the assemblies are called. I was just wondering if this is intended, or a bug?

Best wishes, Isabel

Scallop-assembly called first:

# gffcompare v0.11.6 | Command line was:
#gffcompare -r annotation_file.gff -C -T -V -o sca ../scallop/TACO/assembly.gtf ../stringtie2/merged_transcripts.gtf
#

#= Summary for dataset: ../scallop/TACO/assembly.gtf 
#     Query mRNAs :  104870 in   46525 loci  (76206 multi-exon transcripts)
#            (20882 multi-transcript loci, ~2.3 transcripts per locus)
# Reference mRNAs :   30244 in   30244 loci  (25779 multi-exon)
# Super-loci w/ reference transcripts:    14939
#-----------------| Sensitivity | Precision  |
        Base level:    59.4     |    16.4    |
        Exon level:    43.5     |    24.4    |
      Intron level:    53.3     |    33.3    |
Intron chain level:     4.2     |     1.4    |
  Transcript level:     0.4     |     0.1    |
       Locus level:     0.4     |     0.3    |

     Matching intron chains:    1080
       Matching transcripts:     132
              Matching loci:     132

          Missed exons:   51510/158118  ( 32.6%)
           Novel exons:  155874/281983  ( 55.3%)
        Missed introns:   29644/127874  ( 23.2%)
         Novel introns:   92112/204881  ( 45.0%)
           Missed loci:   11049/30244   ( 36.5%)
            Novel loci:   29431/46525   ( 63.3%)

#= Summary for dataset: ../stringtie2/merged_transcripts.gtf 
#     Query mRNAs :   42862 in   22951 loci  (37578 multi-exon transcripts)
#            (8697 multi-transcript loci, ~1.9 transcripts per locus)
# Reference mRNAs :   29993 in   29993 loci  (25544 multi-exon)
# Super-loci w/ reference transcripts:    12550
#-----------------| Sensitivity | Precision  |
        Base level:    53.2     |    24.8    |
        Exon level:    40.9     |    35.1    |
      Intron level:    49.4     |    39.1    |
Intron chain level:     3.4     |     2.3    |
  Transcript level:     2.9     |     2.0    |
       Locus level:     2.9     |     3.8    |

     Matching intron chains:     872
       Matching transcripts:     862
              Matching loci:     862

          Missed exons:   62962/156883  ( 40.1%)
           Novel exons:   84607/184385  ( 45.9%)
        Missed introns:   40458/126890  ( 31.9%)
         Novel introns:   58014/160463  ( 36.2%)
           Missed loci:   13624/29993   ( 45.4%)
            Novel loci:    9634/22951   ( 42.0%)

 Total union super-loci across all input datasets: 46604
  (18509 multi-transcript, ~2.8 transcripts per locus)
127505 out of 129094 consensus transcripts written in sca.combined.gtf (1589 discarded as redundant)

Stringtie-assembly called first:

# gffcompare v0.11.6 | Command line was:
#gffcompare -r annotation_file.gff -C -T -V -o str ../stringtie2/merged_transcripts.gtf ../scallop/TACO/assembly.gtf
#

#= Summary for dataset: ../stringtie2/merged_transcripts.gtf 
#     Query mRNAs :   42862 in   22951 loci  (37578 multi-exon transcripts)
#            (8697 multi-transcript loci, ~1.9 transcripts per locus)
# Reference mRNAs :   30244 in   30244 loci  (25779 multi-exon)
# Super-loci w/ reference transcripts:    12550
#-----------------| Sensitivity | Precision  |
        Base level:    52.8     |    24.8    |
        Exon level:    40.6     |    35.1    |
      Intron level:    49.0     |    39.1    |
Intron chain level:     3.4     |     2.3    |
  Transcript level:     2.9     |     2.0    |
       Locus level:     2.9     |     3.8    |

     Matching intron chains:     872
       Matching transcripts:     862
              Matching loci:     862

          Missed exons:   64197/158118  ( 40.6%)
           Novel exons:   84607/184385  ( 45.9%)
        Missed introns:   41442/127874  ( 32.4%)
         Novel introns:   58014/160463  ( 36.2%)
           Missed loci:   13875/30244   ( 45.9%)
            Novel loci:    9634/22951   ( 42.0%)

#= Summary for dataset: ../scallop/TACO/assembly.gtf 
#     Query mRNAs :  104870 in   46525 loci  (76206 multi-exon transcripts)
#            (20882 multi-transcript loci, ~2.3 transcripts per locus)
# Reference mRNAs :   30206 in   30206 loci  (25753 multi-exon)
# Super-loci w/ reference transcripts:    14939
#-----------------| Sensitivity | Precision  |
        Base level:    59.5     |    16.4    |
        Exon level:    43.5     |    24.4    |
      Intron level:    53.4     |    33.3    |
Intron chain level:     4.2     |     1.4    |
  Transcript level:     0.4     |     0.1    |
       Locus level:     0.4     |     0.3    |

     Matching intron chains:    1080
       Matching transcripts:     132
              Matching loci:     132

          Missed exons:   51419/158027  ( 32.5%)
           Novel exons:  155874/281983  ( 55.3%)
        Missed introns:   29591/127821  ( 23.2%)
         Novel introns:   92112/204881  ( 45.0%)
           Missed loci:   11011/30206   ( 36.5%)
            Novel loci:   29431/46525   ( 63.3%)

 Total union super-loci across all input datasets: 46604
  (18509 multi-transcript, ~2.8 transcripts per locus)
127504 out of 129068 consensus transcripts written in str.combined.gtf (1564 discarded as redundant)
gpertea commented 4 years ago

That looks like a serious bug. I tried to reproduce that here but could not. That should not happen, the reference set of transcripts is loaded and clustered only once and it should not change across processing of multiple query files.

It's that bad and unexpected that I am beginning to suspect some sort of data corruption going on there. Running the program through sanitizers does not seem to reveal anything low level on my system with my data files. However perhaps there is a bug somehow triggered by specific input like the one you had there. Is it possible to share those 3 files with me? (annotation_file.gff + the 2 query files)

Btw v0.11.6 introduced an unnecessarily stringent calculation of transcript (and locus) level accuracy (along with some possibly buggy counts there). I just released v0.11.7 correcting that (and reversing to the same calculation used before v0.11.6). Unfortunately that has nothing to do with the buggy behavior reported here (tested both v0.11.6 and v0.11.7 and could not reproduce this strange behavior with either).

CIWa commented 4 years ago

Yes, no problem, please find the link to the related data in your gmail-inbox. Thanks also for the info about the update, I will make sure to use this version in the future then. I tested it already on the data set reported here, but, as you expected, it did not change anything about the bug.

Thank you for the quick help! Best wishes, Isabel

gpertea commented 4 years ago

Thank you for the data. It seems the issue is caused by the somewhat unusual situation (that I did not take into account) where each of the query files have a different set of reference sequences (scaffolds/contigs: 4607 for the scallop assembies, 4286 for the stringtie assemblies) and neither of them match the set of reference sequence in the annotation GFF, which is larger (5816).

The reference transcripts are not counted for the reference sequences (scaffolds) having no entries at all in a query transcript data set. I'm working on a fix for this.

gpertea commented 4 years ago

Apologies for the long delay in responding here -- I had fixed this a while ago, the fix is included in the v0.12.1 release

CIWa commented 4 years ago

All right, thanks a lot!

Best wishes, Isabel