twsaari / FeatureSequence

JBrowse plugin to view the sequence of features
GNU General Public License v3.0
7 stars 3 forks source link

Overlapping subfeatures dialog - is this still correct? #5

Closed keiranmraine closed 5 years ago

keiranmraine commented 8 years ago

Hi,

Really happy that this now handles transcripts but can you confirm that the dialog for overlapping sub-features is working correctly?

I have a genome build of Caenorhabditis_elegans - WBcel235 using the Ensembl GFF3 found here as the source (filtered down to protein_coding only):

ftp://ftp.ensembl.org/pub/release-85/gff3/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.85.gff3.gz

Using the gene hmr-1 as an example I get the following if I select the first transcript (WB02B9.1a.1) in the drop-down:

Warning: overlap between subfeatures exon_21 and five_prime_UTR_22
Warning: overlap between subfeatures CDS_11 and exon_20
Warning: overlap between subfeatures exon_19 and CDS_10
Warning: overlap between subfeatures CDS_9 and exon_18
Warning: overlap between subfeatures CDS_8 and exon_17
Warning: overlap between subfeatures CDS_7 and exon_16
Warning: overlap between subfeatures CDS_6 and exon_15
Warning: overlap between subfeatures exon_14 and CDS_5
Warning: overlap between subfeatures exon_13 and CDS_4
Warning: overlap between subfeatures three_prime_UTR_1 and exon_2
...

Are you able to verify this is the correct behaviour?

Thanks, Keiran

twsaari commented 8 years ago

Hi Keiran,

This is a scenario which I didn't forsee. In the GFF you provided, the UTRs are both explicitly and implicitly defined. By implicitly defined I mean that it's determined by taking the 'exon' and 'CDS' features together, and finding their difference in overlap. And by explicitly defined of course I mean that there's also an actual line in the GFF for each three_prime_UTR or five_prime_UTR. This is a problem because it's two ways of encoding the same information. You might want to check and see if these information are identical.

Removing either one of these will solve the problem, but I don't know if that's something you want to do. If I remove the three_prime_UTR's and five_prime_UTR's from the GFF, then the plugin still recognizes the UTRs from their implicit definitions. If I conversely remove the exons from the GFF, then the plugin goes by the explicit definitions (as removing the exon features also removes the implicitly defined UTR information contained within them).

I'm somewhat hesitant to implement a generalized logic for this type of thing, as it's dependent on making more assumptions, e.g. which information to ignore. And you know what they say about assumptions...

twsaari commented 8 years ago

There was also a small bug with sorting, which is now fixed on both branches. If you do a git pull, you'll find that the error message will better reflect reality now.

keiranmraine commented 8 years ago

Hi, I see the problem. Is it possible to provide a configuration option to ignore implicit UTR. it would make it very easy to test and have possibly leave it as a interface option so that data generation can remain 'unaltered' from the original.

It seems that the issue may be in bin/flatfile-to-json.pl, perhaps this should be filtering this data appropriately. It displays fine in the browser.

keiranmraine commented 8 years ago

FYI, the error is far less verbose with this sort fix. Initial dig through a few genes setting a highlight on 5'UTR and lowercase UTR shows it all lining up

colindaven commented 8 years ago

Thanks for the new version, works great with GFF3 files produced by gmap on non-model organisms.

These files do typically have (mostly) overlapping CDS and exon features.