.tracking file does not always show the 'closest' reference transcript for SETs

junghoon-shin commented 4 years ago

My input GTF file (input.gtf) contains this single-exon transcript (among many others).

1       StringTie       transcript      30366   30563   1000    +       .       gene_id "MSTRG.8"; transcript_id "ENST00000607096.1"; gene_name "MIR1302-11"; ref_gene_id "ENSG00000243485.2"; 
1       StringTie       exon    30366   30563   1000    +       .       gene_id "MSTRG.8"; transcript_id "ENST00000607096.1"; exon_number "1"; gene_name "MIR1302-11"; ref_gene_id "ENSG00000243485.2";

My reference GTF file (reference.gtf) contains these two transcripts (among many others).

1       HAVANA  transcript      30267   31109   .       +       .       gene_id "ENSG00000243485.2"; transcript_id "ENST00000469289.1"; gene_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-11-002"; level 2; tag "not_best_in_genome_evidence"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
1       HAVANA  exon    30267   30667   .       +       .       gene_id "ENSG00000243485.2"; transcript_id "ENST00000469289.1"; gene_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-11-002"; exon_number 1; exon_id "ENSE00001841699.1"; level 2; tag "not_best_in_genome_evidence"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
1       HAVANA  exon    30976   31109   .       +       .       gene_id "ENSG00000243485.2"; transcript_id "ENST00000469289.1"; gene_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "lincRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-11-002"; exon_number 2; exon_id "ENSE00001890064.1"; level 2; tag "not_best_in_genome_evidence"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002841.2";
1       ENSEMBL transcript      30366   30503   .       +       .       gene_id "ENSG00000243485.2"; transcript_id "ENST00000607096.1"; gene_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-11-201"; level 3; tag "basic"; havana_gene "OTTHUMG00000000959.2";
1       ENSEMBL exon    30366   30503   .       +       .       gene_id "ENSG00000243485.2"; transcript_id "ENST00000607096.1"; gene_type "lincRNA"; gene_status "NOVEL"; gene_name "MIR1302-11"; transcript_type "miRNA"; transcript_status "KNOWN"; transcript_name "MIR1302-11-201"; exon_number 1; exon_id "ENSE00003695741.1"; level 3; tag "basic"; havana_gene "OTTHUMG00000000959.2";

I ran gffcompare using this command.

gffcompare -r reference.gtf -o gffcompare input.gtf

The resulting gffcompare.tracking file contains this row.

TCONS_00000011  XLOC_000003     ENSG00000243485.2|ENST00000469289.1     c       q1:MSTRG.8|ENST00000607096.1|1|0.000000|0.000000|0.000000|198

But the input transcript ENST00000607096.1 is obviously more 'close' to the reference transcript ENST00000607096.1 than to the reference transcript ENST00000469289.1. Hence I think the third column of the .tracking file should have shown ENSG00000243485.2|ENST00000607096.1. Is this a bug or am I missing something? I used gffcompare v0.11.7.

gpertea commented 4 years ago

That is an interesting observation and indeed it exposes a shortcoming of the current classification system in gffcompare: it uses a simple table of 'class code' priorities, and 'c' code is always considered to be better than 'k' code (which are the two codes here for those data, you can use the trmap utility which is included with gffcompare to see all the possible overlap codes found there, not just the "top" one like gffcompare shows).

Besides that class code priority table, the overlap length is considered next -- and in that case again you can see that ENST00000469289.1 (the lincRNA) has a higher number of bases in common.

But I agree with you that your example seems to suggest that the same number of exons should be taken into account as well, or at least in case of SETs, overlaps with other SETs should automatically have a higher priority than overlaps with METs.

However this seems to be a rather unusual reference annotation in that region - the lincRNA there seems a bit dubious. Transcriptionally I am not sure if that is biologically possible (is it?), having a transcript totally covering/overriding another transcript like that, and it's clear that gffcompare was not ready to deal with such situations -- essentially what you have there is a 'c' relationship between two reference transcripts overlapping transcriptionally like that, with the larger multi-exon transcript overriding the shorter one which is located within one of its exons. Can that be real?

It is also a bit of a problem that GffCompare also expects the query data to be more "fragmented" and incomplete than the reference annotation, so in this particular situation, considering that the query transcript covers more of the lincRNA and goes beyond the boundaries of the miRNA, the assumption would be that the query transcript is rather a fragment of the lincRNA than an extension of the miRNA.. That may well be wrong but you have to realize it's a tough call there in that situation.

junghoon-shin commented 4 years ago

Thank you very much for the clear and detailed explanation. Regarding whether it is biologically possible that a miRNA locus is entirely contained in another transcript locus, I think it is. This paper might be helpful (see Figure 3).

gpertea commented 4 years ago

Interesting, thank you for pointing me to that paper. That situation shown there (Fig. 3 C) suggests that most often such (already rare) exon embedded miRNA extends into the adjacent intron, which makes it closer to a distinct transcript rather than just being fully embedded/contained into another transcript's exon.

gpertea commented 3 years ago

I think I finally have this working properly in v0.12.5 (last commits to master branch here) as I added now special treatment for SETs (giving priority to SET overlaps as we discussed here) and also adding the number of matched junctions as a an additional "structural similarity" criterion for METs (instead of just using the overlap length). Thank you for bringing attention to these issues.

gpertea / gffcompare

.tracking file does not always show the 'closest' reference transcript for SETs #58