gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
199 stars 32 forks source link

Difficulty matching behavior between older and newer gffcompare #18

Closed bioinfomagician closed 6 years ago

bioinfomagician commented 7 years ago

Hi,

I am taking a few stringtie merged assemblies of mouse RNAseq data and annotating them back against Gencode vM12 gtf files using gffcompare. I have noticed some differences between older versions of gffcompare and the newer versions where the arguments and default modes of gffcompare have been changed.

With version v0.9.6f, I would annotate the merged gtf file against the gencode gtf with the -G -C options which were defined as doing the following:

-G generic GFF input file(s): do not assume Cufflinks/Stringtie GTF input, (do not discard intron-redundant transfrags) -C include the "contained" transcripts in the .combined.gtf file

In the newest version cloned from github, v0.9.9d, -G is now the default and -C appears to do the opposite of what -C was defined as doing in the older versions, though I could be misinterpreting this.

-C discard the "contained" transfrags in the .combined.gtf (i.e. collapse intron-redundant transfrags across all query files)

So, I figured I would want to run the newer versions of gffcompare without -G and -C in order to mimic the behavior of the older versions of gffcompare with the -G and -C options. However, I am seeing cases where transcripts are being treated as duplicates, even when they are not the same as each other and then getting merged together. So, I tried running with -C and -C -K options, but generally, no matter what I do, I haven't been able to identify the right combination of arguments to mimic the behavior of older versions of gffcompare with -G -C. I also noticed that whenever I run with -C option, that I get a .combined.gtf file instead of a .annotated.gtf file when running without the -C option.

I've attached an image of IGV showing gencode annotatoin for a single exon gene with two transcripts of differing lengths in the gencode annotation. I also show the stringtie assembly and then the older version of gffcompare with -G -C and the newest version of gffcompare with multiple different options. You can see that in the newer version, the contained transcript is removed. This is a simple example, but I see this in multi exon genes as well.

case of merging mouse gffcompare

Maybe I am missing something obvious, so could you please give advice on how I can run the newest version of gffcompare to preserve the contained transcripts?

bioinfomagician commented 7 years ago

Also, I figured I'd post another example if that was more helpful. This is of gene Olfr1440. The transcript that goes away is highlighted in grey in the first three rows.

case of merging mouse gffcompare 2

bioinfomagician commented 7 years ago

Bumping this for a potential update.

gpertea commented 7 years ago

Indeed the changes are inconsistent, I apologize for the confusion (and the late reply -- it was a good idea to bump this up, I meant to address it but it got pushed deeper and deeper on my TODO stack).

So the -G change should not a big deal, most people actually wanted to use it so I just made it the default.. But the -C option is indeed a bit messy, I shouldn't have made such a drastic reversal of its meaning. This option only kicks in at the end of the GTF processing, when the "super-loci" are created and just before writing the "combined" GTF file -- which is by default only meant to discard the "matching" transcripts between samples, not the "contained" ones. So the -C option now is suppose to discard "contained" isoforms as well.

Now I realize, thanks to your observations, that in the latest versions of gffcompare I actually broke the implied -G default mode of operation, because what you are noticing there (thank you for the examples provided!) is the forced discarding of the "matching", "redundant" transcripts from within a single sample. Even though they look "contained", remember that "matching" in cuffcompare terms actually means just "matching intron chain", and for single-exon transcripts, it means that there is an overlap covering at least 80% of the longer transcript..

And I think I can actually recall the moment when I broke that "-G" promise in a recent code change, it was because in the case of multiple query files (samples) I realized that there were some serious complications with the "tracking" of such transcripts across samples, whenever multiple "matching"/"redundant" transcripts within a sample were allowed to coexist..

But I totally agree with your reasonable expectation that whenever a single sample file is provided, the file should be indeed just annotated, as promised, and no input transcripts should be discarded (unless, indeed, the -C option is provided explicitly requesting this, and then transcripts are "combined" and no longer just annotated..).

So I'll mark this as a bug to be corrected in the next release of gffcompare, as soon as possible -- thank you again for spotting this problem and reporting it in such a well documented manner.

bioinfomagician commented 7 years ago

Thank you for the very thorough response. That all makes a lot of sense to me. I am happy that I was able to point this out in a relatively clear way for you. I look forward to the update.

gpertea commented 6 years ago

Apologies for the very late addressing of this long-standing bug. v0.10.2 has the patch which should finally prevent the loss of any input transfrags when a single query file is given (which triggers the so called "annotation mode" when an .annotated.gtf file is generated).

bioinfomagician commented 6 years ago

Glad to hear it!