Protein sequence feature viewer (for domains mutants and modifications)

ValWood commented 1 year ago

Replaces https://github.com/pombase/website/issues/667

Features

Essential

Actively maintained
Good speed, especially for loading, and viewing multiple features quickly with mouse-over

Configurable

Ability to configure tracked of different types
Ability to configure pop-ups associated with tracks

Browser features essential

Zoomable feature viewer (into amino acid residue)
Coupling to protein feature viewer (only need to pick one full length representative structure)
Coupling mutants to phenotypes (probably via pop-ups or tables)

Desirable Ability for user configure active tracks on and off (Calipho has this) Download svg image ? Calipho has this) Ability to link to the equivalent residue in human (I'm not sure how that could happen, it would need to be precomputed from alignments (possibly Panther families could be used for this). This isn't super urgent.

Tracks we would like to display initially 1 Single sequence region mutants (i.e A123G, OR A234G,A235G OR 123-223 delta) with associated phenotypes

Modifications (ALL)
Pfam domains
TMM regions
Low complexity regions
Coiled coils

Ideally, these regions would display in structure viewer and give a mouse-over pop-up with “details”

Later datatypes Secondly structure regions Hydropathy? Others?

Examples RCSB https://www.rcsb.org/3d-sequence/4OFB?asymId=A Calipho https://www.nextprot.org/entry/NX_P38398/sequence Protista https://www.uniprot.org/uniprotkb/P38398/feature-viewer

kimrutherford commented 1 year ago

Progress so far: https://desktop.kmr.nz/gene/SPBC19C2.09

ValWood commented 1 year ago

Looking good!

kimrutherford commented 1 year ago

Modifications!

https://desktop.kmr.nz/gene/SPBC19C2.09

kimrutherford commented 1 year ago

Should we put the allele/variants and modifications at the top since they're important?:

Also are "Variants" and "Modifications" OK as labels?

kimrutherford commented 1 year ago

RCSB use capitals for the track labels. Should we do the same to make them stand out?:

kimrutherford commented 1 year ago

We should cite: Joan Segura, Yana Rose, John Westbrook, Stephen K Burley, Jose M Duarte. RCSB Protein Data Bank 1D tools and services, Bioinformatics, 2020; https://doi.org/10.1093/bioinformatics/btaa1012

ValWood commented 1 year ago

I don't think it is working on your desktop right now. Hopefully you went to bed. I'm leaning towards the capitalization. @manulera what do you think?

Good progress!

ValWood commented 1 year ago

The example for multiple modification types for a single residue was hht1 I will try to find one with more...

ValWood commented 1 year ago

I also agree to switch the order.

kimrutherford commented 1 year ago

I don't think it is working on your desktop right now.

While I'm changing things it often breaks or goes offline. It should be working in the morning. I'm off to bed now so I'll be turning the desktop off.

I've added Pfam families, but since there's on mouse-over details, it's not very useful. I plan to add mouse-overs tomorrow.

I also agree to switch the order.

Done.

The example for multiple modification types for a single residue was hht1

OK, thanks. That will be good for testing tomorrow:

ValWood commented 1 year ago

Very nice!

kimrutherford commented 1 year ago

I've added mouse overs on my desktop version: https://desktop.kmr.nz/gene/SPAC1834.04

The text is very minimal at the moment. What should be displayed for variants and modifications? I think it will look best if we don't have more than two lines of text, if possible.

kimrutherford commented 1 year ago

I've now added the protein feature view to the dev site so you don't need to rely on my machine being online: http://dev.pombase.kmr.nz/gene/SPBC19C2.09

ValWood commented 1 year ago

It looks very nice. We. can discuss the text. This is soooo useful.

Some random thoughts which should probably be in new tickets i) we will need to deal with the histone special case for modifications, histones residues are always referred to (numbered) after the methionine has been removed in the mature form (check out hht1 lysine K14). ii) special case 2 CTD domain residues in rpb1 and spt5 (will affect variants, but for rpb1 I see we also have modified residue CTD_S5 removed by etc

Existing Page Section It will be odd to have 2 sections named "protein features" so I suggest that before release: i) rename the 'other' protein feature section to 'protein domains' and ii) split out the protein properties into their own section (we can discuss this, we might be able to ditch and just add charge back to the top matter)

Linking to the structure viewer- bear in mind that many of the pdb entries are fragments (e.g. dcr1), so for these we will need to use alpha fold when mapping residues.

ValWood commented 1 year ago

This is all so good. Especially with all the corrected alleles and modifications @manulera is doing we can be confident about the displays being correct.

I think we should rename the "Variants" row as "mutants" (otherwise for non pombe people they might assume natural variation)

Some suggestions for the "click through" version.

Maybe we should split rows of different classes (major) of modification
Thinking of a way to show 'conjoined' alleles (and possibly deleted regions) I noticed that some of the Pfam domains had 'connectors" maybe this can be used?
I wonder if we can get "active sites" from anywhere. I know pdb have then, but we should be able to get active sites for many domains without structures. I think Pfam make this data (perhaps it is in the InterPro release). Otherwise we might be able to get some from UniProt.

manulera commented 1 year ago

Wow! Looks great! I would have loved it back in the day. Here are some comments:

i) we will need to deal with the histone special case for modifications, histones residues are always referred to (numbered) after the methionine has been removed in the mature form (check out hht1 lysine K14).

This should be OK if using the allele descriptions, since they are corrected for histones, even if the name is different.

Should we put the allele/variants and modifications at the top since they're important?:

I also agree with Val that the order is better with variants and modifs on top

manulera commented 1 year ago

ii) special case 2 CTD domain residues in rpb1 and spt5 (will affect variants, but for rpb1 I see we also have modified residue CTD_S5 removed by etc

For the CTD, let's wait a little bit until I finish with the allele pipeline, and I think we should get a solution

ValWood commented 1 year ago

This should be OK if using the allele descriptions, since they are corrected for histones, even if the name is different.

I should have been clearer. The alleles already render OK. But the modifications use actual sequence coordinates and so are 'off by one' A simple shift in the affected genes should fix it. This is K14 modification on hht1, displays on residue 13:

manulera commented 1 year ago

@ValWood I see, but those should be ammended at some point as well (I think). For now we have only fixed the ones in the HTP files, not the ones stored in Canto. We can discuss this on the next meeting, but I think it would be better to store them right in Canto and have them displayed differently on the "modification" section of the gene page, than keep storing the offsetted coordinates.

kimrutherford commented 1 year ago

I'm leaning towards the capitalization.

I've done that so we can see how it looks.

I also agree with Val that the order is better with variants and modifs on top

That's done now.

I think we should rename the Variants row as mutants

Done.

kimrutherford commented 1 year ago

Single sequence region mutants (i.e A123G, OR A234G,A235G OR 123-223 delta) with associated phenotypes

How should the phenotypes look? Some alleles have multiple associated phenotypes.

123-223 delta

Are those the alleles with type "partial_amino_acid_deletion" and descriptions like "ccq1(131-441)"?

manulera commented 1 year ago

How should the phenotypes look? Some alleles have multiple associated phenotypes.

I think that's going to be a bit tricky, specially for famous alleles, like cdc25-22 and so, which will have many many phenotypes. If there is an ontologic way to restrict to the high order terms that would be the best, I think (some kind of slim, but not sure that's possible).

However, I am not sure variant -> phenotype is the most meaningful link. Probably the user would like to see which alleles give a certain phenotype, rather than what a particular sequence modification does when hovering over it. This does not seem possible/easy on the gene page. However, if we are still thinking of a separate view like the one in the PDB, in which the sequence opens in a different window and you also have the structure, then we could have a scrollable list of all phenotypes like in the gene page, where the user can pick some and only the alleles that give those phenotypes can be displayed. In the same way, it could be that when you click on a particular modification, a list with all phenotypes associated with the modification is rendered below the graph (clearly this cannot be a tooltip when mouseover).

I thought what would be really nice instead of a phenotype list is to have some ontology tree with only the FYPO terms of that gene so you can also pick high-order terms, but then I had a look at the cut phenotype tree in OLS and realised that it would look atrocious (https://www.ebi.ac.uk/ols4/ontologies/fypo/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FFYPO_0000229). That tree is for one term only! Admitedly, perhaps one of the most connected, and famous phenotypes in pombe, but still...

manulera commented 1 year ago

Are those the alleles with type "partial_amino_acid_deletion" and descriptions like "ccq1(131-441)"?

Yes, in principle the descriptions of partial_amino_acid_deletion alleles always contain the missing residues, while their names may contain the missing or kept amino acids, depending on the authors. Say, protein ase1 has 100 aminoacids, they may name an allele missing aa 60-100 as ase1(1-59) -what's kept- or ase1delta(60-100) -what's missing-. In theory, in the description there should only be the missing ones, independently of what the name is, but I am sure there are some errors out there.

kimrutherford commented 1 year ago

Thinking of a way to show 'conjoined' alleles (and possibly deleted regions) I noticed that some of the Pfam domains had 'connectors" maybe this can be used?

I tried that but unfortunately it doesn't work well in every case. It's good on the cdc10 page:

but not so friendly on the mrc1 page:

because of allele descriptions like "T121A,T327A,T513A,S572A,S599A,S604A,S614A,T634A,S637A,T645A,T653A,S938A,T965A,S1000A"

kimrutherford commented 1 year ago

Are those the alleles with type "partial_amino_acid_deletion" and descriptions like "ccq1(131-441)"?

Yes, in principle the descriptions of partial_amino_acid_deletion alleles always contain the missing residues,

Thanks Manu.

What does it mean if a partial_amino_acid_deletion has a description like "W316*"?

Also I found one that seems a bit strange. It's a partial_amino_acid_deletion for an RNA gene?: nc-tco1-LΔ::ura4+(-395-+146)

kimrutherford commented 1 year ago

Are those the alleles with type "partial_amino_acid_deletion" and descriptions like "ccq1(131-441)"?

I tried adding those alleles and it didn't work out too well in some cases. Below is the diagram for mrc1.

Gene page: https://desktop.kmr.nz/gene/SPAC694.06c Full diagram: https://desktop.kmr.nz/protein_feature_view/widget/SPAC694.06c It will be at https://dev.pombase.kmr.nz/gene/SPAC694.06c on Monday morning.

kimrutherford commented 1 year ago

What does it mean if a partial_amino_acid_deletion has a description like "W316*"?

The nomenclature paper says that the * is a stop codon, which makes sense. But in that case why doesn't W316* have the type "amino_acid_mutation"? Is it because the protein gets truncated?

manulera commented 1 year ago

The nomenclature paper says that the is a stop codon, which makes sense. But in that case why doesn't W316 have the type "amino_acid_mutation"? Is it because the protein gets truncated?

Yes, this is a truncated protein. In the past we used to record them as "nonsense_mutation" but we decided to merge with partial aminoacid deletion, since at the product level there is no difference in principle.

So, for the sake of the feature viewer, W136* is equivalent to 316-XXX where XXX is the protein length.

kimrutherford commented 1 year ago

Are those the alleles with type "partial_amino_acid_deletion" and descriptions like "ccq1(131-441)"?

I tried adding those alleles and it didn't work out too well in some cases

On some pages it looks very nice though. This is sre1 on the dev server as of this morning: http://dev.pombase.kmr.nz/gene/SPBC19C2.09

manulera commented 1 year ago

Looks great!

I noticed that in the truncation section the "chunks" that are on the same lane do not necessarily belong to the same construct. For example:

Screenshot 2023-07-10 at 09 39 49

Screenshot 2023-07-10 at 09 42 03

All chunks may get distributed to minimise the rows or something, perhaps that's why there are many rows in that example where ends meet perfectly. If there is a way to link two fragments to indicate they belong to the same construct and force them to appear in the same lane, that would be the best (with a thin line in the middle like the disulfide bonds in the PDB display?) not sure if that's possible / documented.

ValWood commented 1 year ago

However, I am not sure variant -> phenotype is the most meaningful link. Probably the user would like to see which alleles give a certain phenotype, rather than what a particular sequence modification does when hovering over it. This does not seem possible/easy on the gene page. However, if we are still thinking of a separate view like the one in the PDB, in which the sequence opens in a different window and you also have the structure, then we could have a scrollable list of all phenotypes like in the gene page, where the user can pick some and only the alleles that give those phenotypes can be displayed. In the same way, it could be that when you click on a particular modification, a list with all phenotypes associated with the modification is rendered below the graph (clearly this cannot be a tooltip when mouseover).

This would be very useful. I wonder if it could operate like the filters so you would check specific phenotypes and modifications and the display would reduce to show only those.

ValWood commented 1 year ago

I suggest we open new tickets for each outstanding task, or feature request, as this ticket is in danger of becoming difficult to navigate.

ValWood commented 1 year ago

Hi @kim the stop codon change was documented here in the news item Curation update - “nonsense mutation” merged into “partial amino acid deletion” I'll close off those comments but let us know if it doesn't make sense

kimrutherford commented 1 year ago

All chunks may get distributed to minimise the rows or something, perhaps that's why there are many rows in that example where ends meet perfectly. If there is a way to link two fragments to indicate they belong to the same construct and force them to appear in the same lane, that would be the best

Here's what that would look like: https://desktop.kmr.nz/protein_feature_view/widget/SPAC694.06c

(Note that the zooming and scrolling doesn't work correctly on that page. I've created a issue about that)

manulera commented 1 year ago

Here's what that would look like:

Screenshot 2023-07-10 at 10 40 35

Yes! That's perfect! That was fast!

ValWood commented 1 year ago

Task in new tickets

Protein feature viewer: decide text for mouse overs pombase/website#2068

Protein feature viewer: "partial_amino_acid_deletion" pombase/website#2067

Protein sequence feature viewer show 'conjoined' alleles (and possibly deleted regions) I noticed that some of the Pfam domains had 'connectors" maybe this can be used? pombase/website#2066

Protein sequence feature viewer: Find external active site sources pombase/pombase-chado#1126

protein sequence feature viewer: linking to structures pombase/website#2064

Protein sequence feature viewer: existing page section pombase/website#2063

Protein sequence feature viewer: enable phenotype-> residue in full view pombase/website#2062

ValWood commented 1 year ago

The only thing in this ticket is

Also I found one that seems a bit strange. It's a partial_amino_acid_deletion for an RNA gene?: nc-tco1-LΔ::ura4+(-395-+146)

I will fix that one.

pombase / website

Protein sequence feature viewer (for domains mutants and modifications) #2053