gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
198 stars 32 forks source link

Question on gffcompare output #48

Closed ljw90607 closed 4 years ago

ljw90607 commented 4 years ago

Hello, I am currently working on the isoform detection analysis using gffcompare. I was wondering if there are some cases on the reference (Gencode for instance) that have quite a similar isoform structure variation where it might be difficult for the input transcript to be discriminated, how would the input transcript be decided for its annotation? If the input transcript gets annotated to one of the closest matching reference transcripts, what would be the deciding factor? (more matching bases?) I could not get a clear answer that I wanted from the utility page. If you help with this issue, I would really appreciate it!

Jungwoo.

gpertea commented 4 years ago

Good point. Besides the priority assigned to each "class code" , gffcompare also keeps track of the "overlap length" (or "matching bases" as you suggested). So between multiple references with the same 'class code' , the one with the largest overlap will be chosen. Now for the "matching" references (class code =), recently I've added the concept of "overlap score", which is the overlap length minus the length of the "overhangs" (shown in red in the attached picture for two example transcripts (in blue) matching the same reference transcript (shown in black) ). match_overhangs

The situation exemplified in this picture is actually a bit tricky if you consider the problem from the point of view of the reference - which one of the query transcripts are the "closest", or "best match" for that reference transcript? (the .refmap file produced by gffcompare is supposed to answer this question). It is hard to answer because the "overlap scores" are very close in this case, because the first transcript has a smaller overlap there for the first exon, despite having shorter overhangs.

Getting back to your question, to summarize: if a query transcript overlaps multiple references, for simple non-match overlaps the better overlap length is the deciding factor, except for class code '=' where the better overlap score is deciding, where overlap_score = overlap_len - overhangs_len. If the _overlapscore is the same, then it's a toss-up - the "matching" reference to use for annotation is chosen randomly. Note that in the annotation data usually there are not multiple "matching" references in the same location, and gffcompare will actually discard most of those as "duplicates" (though it's quite strict about their structural matching when it does that).

ljw90607 commented 4 years ago

Dear @gpertea

Thank you very much for your wonderful comment. It is really helpful.

The steps you have explained does make sense for the "matching" transcripts. But what if there are some isoform candidate transcripts that have internal exon variation which can be called as "junction match", and have similar reference transcripts for their match? Would there be any deciding factor for this situation? (or have overlap score?)

In my understanding, the "matching" class transcripts are only called when all the internal exon boundaries do match with the reference. But I am having a quite difficult time figuring out the logic of which "junction match" transcripts are annotated from which reference transcripts.

Thank you so much again for your help!

Jungwoo

gpertea commented 4 years ago

To clarify : if a query transcript matches a reference transcript (using class code =) and also overlaps other reference transcript(s) (using class code j, which means at least one intron-exon junction matches), obviously the output files produced by gffcompare will show only the = matching reference transcript (due to the class code priorities, = always overrides j ).

However if a query transcript does not "match" any reference transcript (no class code =) but just overlaps multiple reference transcripts with the same class code j, unfortunately gffcompare is currently using only the amount of matching bases (overlap length) to choose as the "closest" reference, so it will pick the reference transcript that has "more bases in common with" the query transcript to show in the output files. Now I do see how that is a rather poor/weak criterion -- because it can be argued that the number of matching junctions should always take precedence over the number of common bases, right? And I can agree to that, perhaps I should implement a better scoring scheme for these cases. Some of the more interesting isoform variations are already covered by other class codes (c or k for containment, m or n for retained introns etc.) which all have a higher priority over the bland j class code.. But when it comes to just multiple j code assignments, indeed gffcompare could be found guilty of using only the number of common bases to pick the "closest" reference/annotation transcript.

If you are interested in a more thorough analysis of all the junction overlaps between isoforms and reference transcripts you could use the trmap utility which is distributed with gffcompare and that will output all the overlaps (with their class code) found between every query transcripts vs. a set of reference transcripts. The output of that utility includes all the exon coordinates (for both query and reference transcripts) that could be easily used for further analysis of those overlaps, so one could write a script to parse that output and pick a better "representative" reference transcript for a query that only has j code overlaps with the reference transcripts.

ljw90607 commented 4 years ago

Thank you very much again for wonderful comment. I will definitely try the suggestion.

ljw90607 commented 4 years ago

Dear @gpertea,

I would like to ask a quick question about the previous topic that we have discussed. I found out that the sample transcript that shows the missing exon (around 50bp) compare to the reference shown below, was labeled with the coverage at 100% percentage using "trmap" -S option.

trnascript_pic_1

I even tried annotating the reference to the sample and still observed the same result (in this case, transcript A at 100% to the sample transcript B) I am not quite sure whether the algorithm ignores some of the short bases containing exons?

If you have any idea of what this could be, please let me know! Thank you for your wonderful help!

Jungwoo

gpertea commented 4 years ago

Sorry to admit I forgot why the -S option even exists, but I guess the original intent was to provide a value showing how much of each (query) transcript length is covered ("confirmed") by "annotated" transcription bases (i.e. reference transcripts' exons). If that was the original intent then it seems that the 100% value there is correct, because all the bases of Transcript B seem to be covered by reference transcript(s) bases, right? Not the other way around, of course. The fact that there is no actual structural match between transcript B and transcript A does not seem to matter for this -S output. If you really want to be more precise about structural matching/overlap between sample and reference then perhaps you should not use the -S option and use the default output of trmap instead (and compute your own coverage for the = class overlaps)

But you also say that when you swapped query and reference you still got 100% coverage for transcript A by transcript B (only by transcript B? or were there other transcripts in the sample overlapping that short exon ? ). Looking at the proportions of exons in the figure you have there I think it is possible for this to be a silly rounding issue, as that exon seems very small relative to the entire length of the transcript (assuming there are many more large exons not shown in the truncated figure). What is the total exonic length of transcript A ?

ljw90607 commented 4 years ago

I see. I thought that the value intended the coverage of bases referring to that specific transcript.

Here I am providing the length for both transcript A & B. Transcript A : 1549bp Transcript B : 1535bp

Although the overall length different is only 14bp, the missing exon in Transcript B is 24bp because the exon next to it (towards the 5' end) is 10bp longer in Transcript B than A.

gpertea commented 4 years ago

A late update to this: I realized that trmap with -S option was actually not computing exon coverage, but just plain full-range interval coverage (start..end of the transcripts). That was not very useful, I suppose. The latest release actually changed this so trmap -S now reports specifically exon coverage.

clsteam commented 4 years ago

Dear @gpertea , Thank you for your wonderful comment. I have some questions for the follow-up to this question: What is the priority principle of class code? The result file obtained by trmap contains all overlaps. As you said above, the result of gffcompare is = if overlaps contains =, but if it contains multiple other types of ’class code' (such as y, i, m, j ..., without the =), how does it work in the end? Is overlap_score used to judge?

Give a simple example: If I now need to screen candidate transcripts of lncrna based on the results of trmap, should I put transcripts that include uxijo and = into the candidate set?

Thank you, Yao

gpertea commented 4 years ago

Currently the class codes have a priority of each class code is somewhat reflected in the table found on the documentation page (https://ccb.jhu.edu/software/stringtie/gffcompare.shtml#transfrag-class-codes), though some of those code are "tied" in terms of priority level and then that tie is broken by the overlap length as you assumed. Some codes have no direct overlap (zero overlap length) - e.g. i and y are by definition not overlapping any exons. Those codes have an absolute code-level priority set lower than the actually overlapping codes.