twsaari / FeatureSequence

JBrowse plugin to view the sequence of features
GNU General Public License v3.0
7 stars 3 forks source link

Cases where some subparts don't highlight correctly #14

Open scottcain opened 5 years ago

scottcain commented 5 years ago

Hi @tsaari88 ,

I have a case at WormBase where FeatureSequence doesn't seem to get the subparts to highlight correctly. Could you take a look at our bug https://github.com/WormBase/website/issues/6819 and let me know if you have any thoughts?

Thanks, Scott

twsaari commented 5 years ago

Hi Scott.

This is due to the presence of overlaps between subfeatures in the underlying annotation. Overlapping subfeatures are fundamentally problematic for this program, and that's unlikely to change. I couldn't find a way to consistently and correctly highlight/hide/etc overlapping subfeatures without a bunch of extra logic and headaches regarding the differing possibilities inherent with nested boundaries.

Knowing this, I did two things for when overlapping subfeats are detected:

  1. In the common case of a CDS/UTR overlapping with an exon, I included some logic to check that the boundaries make sense and then basically exclude the exon data (it is redundant in that case). Once this is completed, the overlaps should no longer exist.

  2. If overlaps still persisted, I (perhaps foolishly?) still allowed FeatureSequence to continue with an older (slower and not recommended) implementation of the viewer that ignores the overlaps and instead warns the user that problems are likely. When you see the warning dialog which shows up in the referenced videos, this is what the end of the full text reads:

Overlapping subfeatures will cause problems in viewing their boundaries. This may also cause the Feature Sequence Viewer to respond slowly.

Is it reasonable for you to modify the annotation data? If so, perhaps you could provide a relevant portion of your annotation file for the track in question and we could figure it out? I think that the automatic CDS/UTR/exon parsing is failing, and that's why you're seeing this.

scottcain commented 5 years ago

Hi @tsaari88 ,

Thanks for looking at this. Here is a sample of GFF that is causing problems:

https://gist.github.com/scottcain/8945cfa9ca820b5d287dd0c428785264

and the corresponding JBrowse track for the B0304.1c.1 transcript:

https://staging.wormbase.org/tools/genome/jbrowse-simple/full.html?data=data%2Fc_elegans_PRJNA13758&loc=II%3A4517520..4523069&tracks=Curated%20Genes%20(protein%20coding)&highlight=

A particular oddity is the warning message in the dialog between five_prime_UTR_4 and exon_5, since there aren't 4 5' UTR lines in this GFF.

scottcain commented 3 years ago

Hi @twsaari

WormBase folks have again been pointing out problems with the FeatureSequence plugin. Is this something you have time to look at? If so, please use these transcripts:

https://wormbase.org//tools/genome/jbrowse-simple/full.html?data=data%2Fc_elegans_PRJNA13758&loc=I%3A310026..315638&tracks=Curated%20Genes%20(protein%20coding)%2CCurated_Genes&highlight=

Unfortunately, I can't link to the bug report because we've taken it private, but this is the most recent comment:

It seems there are still some errors with the "FeatureSequence Viewer" tool in JBrowse. The latest example I came across while looking into a help desk ticket about transcript C53D5.6.2. Here's a summary of the behavior of the tool with regards to this transcript:

Track: Curated Genes Action: right-click on imb-3 gene, click "View Sequence", select transcript C53D5.6.2, and view sequence Issues:

"CDSs" button: OK "UTRs" button: Highlights the portion of the 3'UTR contained on the next-to-last exon, but doesn't include 3'UTR sequence on the final exon; misses the 5'UTR entirely "Exons" button: Only highlights the first and last exon, none of the middle exons "Five_prime_UTRs" button: OK "Introns" button: OK "Others" button: not applicable? "Three_prime_UTRs" button: OK "Upstream" button: OK "Downstream" button: OK Same issues appear on the "Curated genes (protein coding)" track.

Checking other genes in JBrowse, it is clear that the "UTRs" button and the "Exons" button consistently have problems. For one gene/transcript (marc-4/C53D5.2.1), the "Exon" button doesn't appear at all (yes, it is spliced).

So, it seems that the "Exon" button, if it appears, consistently highlights only a small portion of the exon sequence for the transcript. The "UTR" button usually misses most or all of the 5'UTR and sometimes the 3'UTR, but always misses some UTR sequence, particularly when UTRs contain introns.

vaneet-lotay commented 3 years ago

Hello,

I believe I have been having the same issue with FeatureSequence viewer, in that consistently only the first exon is highlighted in the viewer. All of the CDS sequences are highlighted properly from what I can see. Yes this is a case of a GFF with overlapping CDS/exon sequences. I would add though that this gene model structure isn't really going away as I've noticed that when NCBI releases new gene annotations for different genomes they always tend to use overlapping CDS/exon subfeatures, most likely to indicate UTRs at the beginning and end of the transcripts for the region that they do not overlap.

I just wanted to check if this was updated or fixed yet, I understand if it hasn't as it sounds like there's a lot to deal with in these scenarios in terms of your logic. If it hasn't been resolved, is it true from your earlier comment that when there are overlapping subfeatures, all exons are redundant and excluded from highlighting/lowercase features?

What do you suggest in this scenario for users, should they only focus on CDS segments, when both are present?

Thanks,

Vaneet

mictadlo commented 2 years ago

Hi, FeatureSequence only highlights 2 exons but there are 11.

NbLab350C17     scallop gene    92773802        92784469        .       +       .       ID=NbL17g15920
NbLab350C17     scallop mRNA    92773802        92784469        .       +       .       ID=NbL17g15920.1;Parent=NbL17g15920;RPKM=3.7278;Note=uncharacterized LOC109243108 transcript variant X4 XP_019265549.1;evalue=0.00;cov=303.4095
NbLab350C17     scallop exon    92773802        92773977        .       +       .       ID=NbL17g15920.1.exon.0;Parent=NbL17g15920.1;exon=1;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92774154        92774303        .       +       .       ID=NbL17g15920.1.exon.1;Parent=NbL17g15920.1;exon=2;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92775176        92775522        .       +       .       ID=NbL17g15920.1.exon.2;Parent=NbL17g15920.1;exon=3;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92775427        92775522        .       +       .       ID=NbL17g15920.1.CDS.0;Parent=NbL17g15920.1;exon=3;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92776156        92776314        .       +       .       ID=NbL17g15920.1.exon.3;Parent=NbL17g15920.1;exon=4;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92776156        92776314        .       +       .       ID=NbL17g15920.1.CDS.1;Parent=NbL17g15920.1;exon=4;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92777391        92777471        .       +       .       ID=NbL17g15920.1.exon.4;Parent=NbL17g15920.1;exon=5;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92777391        92777471        .       +       .       ID=NbL17g15920.1.CDS.2;Parent=NbL17g15920.1;exon=5;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92781380        92781472        .       +       .       ID=NbL17g15920.1.exon.5;Parent=NbL17g15920.1;exon=6;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92781380        92781472        .       +       .       ID=NbL17g15920.1.CDS.3;Parent=NbL17g15920.1;exon=6;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92781705        92781797        .       +       .       ID=NbL17g15920.1.exon.6;Parent=NbL17g15920.1;exon=7;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92781705        92781797        .       +       .       ID=NbL17g15920.1.CDS.4;Parent=NbL17g15920.1;exon=7;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92781877        92782020        .       +       .       ID=NbL17g15920.1.exon.7;Parent=NbL17g15920.1;exon=8;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92781877        92782020        .       +       .       ID=NbL17g15920.1.CDS.5;Parent=NbL17g15920.1;exon=8;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92783207        92783314        .       +       .       ID=NbL17g15920.1.exon.8;Parent=NbL17g15920.1;exon=9;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92783207        92783314        .       +       .       ID=NbL17g15920.1.CDS.6;Parent=NbL17g15920.1;exon=9;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92783491        92783542        .       +       .       ID=NbL17g15920.1.exon.9;Parent=NbL17g15920.1;exon=10;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92783491        92783542        .       +       .       ID=NbL17g15920.1.CDS.7;Parent=NbL17g15920.1;exon=10;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop CDS     92783630        92783661        .       +       .       ID=NbL17g15920.1.CDS.8;Parent=NbL17g15920.1;exon=11;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
NbLab350C17     scallop exon    92783630        92784469        .       +       .       ID=NbL17g15920.1.exon.10;Parent=NbL17g15920.1;exon=11;gene_id=gene.134870.0;transcript_id=gene.134870.0.7
mictadlo commented 2 years ago

Has anyone found an alternative tool?