gpertea / gffcompare

classify, merge, tracking and annotation of GFF files by comparing to a reference annotation GFF
MIT License
199 stars 32 forks source link

Fuzz matching of intron chains and exons #38

Closed ChristopherWilks closed 5 years ago

ChristopherWilks commented 5 years ago

Hi Geo,

Thanks for a great tool!

I'm using this to compare various short and long read "assemblies" and it would be great if there was an option to allow a "fuzz" distance between intron and exon starts and ends.

I started looking at the code and noticed that you had something for this already:

https://github.com/gpertea/gffcompare/blob/4ed82e0bb1eb5e906f565c22b02c2708e1168115/gffcompare.cpp#L104

and also later in the file where you do the actual matching.

I'm guessing there was a good reason for commenting it out and was wondering what that might be?

I'd be interested in making my own changes to re-enable it for my work, but wanted to find out the gotchas that you might have encountered before diving in.

Thanks, Chris

gpertea commented 5 years ago

There was no good reason that I recall, I think we just got rid of it to save a bit of computation time and to simplify the output as well, as we realized we did not care about those at all, we really only cared about exact matching when it comes to intron coordinates, and we have enough "fuzziness" for the outer boundaries of terminal exons (and thus single-exon transcripts) which is already built in.. Hopefully you can recover most of that code performing the fuzzy matching, I'm afraid I might have also deleted it in some places (instead of just commenting it out), back when I decided to give up on calculating and printing those values. I think cuffcompare might still have it though, although I haven't tested that old cuffcompare code with the latest gclib changes, it might not build/work properly.

ChristopherWilks commented 5 years ago

thanks for the quick reply Geo! I'll give it a shot and see how far I get.

gpertea commented 5 years ago

Correction - sorry, my memory is getting a bit fuzzy about this as well :), I just took at look at the cuffcompare source and I saw that there is no trace of "fuzzy" Sp and Sn stats in the last versions of that code either -- even though I clearly remember printing them back in the day, from cuffcompare, as two additional columns (fSp and fSn) in that table in the .stats file, for each "level". So digging back I found that the last cuffcompare code actually showing fuzzy intron/exon/transcript matching stats is from 2014, this commit seems to be the last one that had it:

https://github.com/cole-trapnell-lab/cufflinks/blob/7e38d32de27d74239de5375a0dbd12283e2259ac/src/cuffcompare.cpp#L1446

It involved keeping track of ATP, AFN and AFP fields alongside TP, FN and FP fields at various levels, in GLocus and GSuperLocus structures defined in gtf_tracking.h (I guess "A" stood for "approximate"). I suppose most of those fields and the related code can be just forward-ported into the current version of cuffcompare with minimal changes, though likely there is a better (less messy) way of doing it from scratch I guess..

ChristopherWilks commented 5 years ago

:) thanks! that'll be very helpful.