Improve display of histone modifications

ValWood commented 11 months ago

Histone alles and modifications are universally described using the processed (- initMet) for the histones.

For alleles we do this:

hht3-K56R(K57R aa)

The name follows the community usage, and the description matches the underlying protein sequence. We use the underlying protein sequence to check that the alleles and modifications conform with the sequence in our QC pipeline without adding a special case for histones.

At the moment we Display modifications using only the preprocessed form which will be very confusing for end users. We would like to change the display so that the universal nomenclature is presented with the processed form in parentheses

K14(K15) processed/preprocessed.

I will add a list of the histones that this applies to.

ValWood commented 11 months ago

Applies to

Systematic ID Gene name Product description SPCC622.08c hta1 histone H2A alpha SPAC19G12.06c hta2 histone H2A beta SPCC622.09 htb1 histone H2B Htb1 SPAC1834.04 hht1 histone H3 h3.1 SPBC8D2.04 hht2 histone H3 h3.2 SPAC1834.03c hhf1 histone H4 h4.1 SPBC8D2.03c hhf2 histone H4 h4.2 SPBC1105.12 hhf3 histone H4 h4.3 SPBC1105.17 cnp1 centromere-specific histone H3 CENP-A SPBC11B10.10c pht1 histone H2A variant H2A.Z Pht1 SPBC1105.11c hht3 histone H3 h3.3

kimrutherford commented 11 months ago

Applies to

Do these have a something like a GO term in common so we don't need to maintain a gene list in the configuration file?

ValWood commented 11 months ago

extended the "applies to" list with Jo's additions

ValWood commented 11 months ago

Do these have a something like a GO term in common so we don't need to maintain a gene list in the configuration file?

https://www.pombase.org/results/from/id/593d5a9e-c876-467e-a531-336791ef7e8b

Unfortunately not -these are described variously as adapters or structural molecules. There is not even a protein family that is specific for the group because this is only a subset of all histone faults.

kimrutherford commented 11 months ago

Could we add some sort of /controlled_curation to annotate these genes? That way the annotation will end up in Chado and the web code will be identify the genes.

ValWood commented 11 months ago

Yes, I can add /controlled_curation="histone"

or does it also need to have a "type"?

kimrutherford commented 11 months ago

or does it also need to have a "type"?

It's been so long, I can't remember. I'll refresh my memory and let you know.

But now I think about it, I wonder if /SO=SO:0000418 would be better?

ValWood commented 11 months ago

this is for signa peptide, not modified histone? but PRO might have a grouping, will check

ValWood commented 11 months ago

nope PRO does not have a grouping (it would be a bit odd anyway)

kimrutherford commented 11 months ago

this is for signa peptide, not modified histone?

Sorry! Please ignore me. :-) I was think about this is at the same time: pombase/website#2115

ValWood commented 11 months ago

I thought you were. They are both removal of N-terminal regions, but a bit different ;)

ValWood commented 11 months ago

I could find, or request the PRO_ID for each one and add controlled_curation=display(PRO:xxx)

kimrutherford commented 11 months ago

/controlled_curation="histone

I think something like this:

/controlled_curation="term=warning, histone"

K14(K15) processed/preprocessed.

Where does that need to be shown on the web site? Sorry, I don't understand histones.

I could find, or request the PRO_ID for each one and add controlled_curation=display(PRO:xxx)

I'm not sure about that but I don't really understand what's needed.

kimrutherford commented 11 months ago

Action for kmr: make a Chado check that we always have 11 histones - 11 annotations to GO:0000786 nucleosome

kimrutherford commented 11 months ago

How's this?:

https://desktop.kmr.nz/gene/SPAC1834.04

kimrutherford commented 11 months ago

Good, but I think it needs to be visible in the full view as well as show details.

It will be. That's what the two images show. The code isn't finished yet so my desktop version is a bit dodgy. I took the screenshot while it was briefly doing the right thing for hht1. :-)

kimrutherford commented 11 months ago

Hi Val.

On the documentation page for the modifications section (https://www.pombase.org/documentation/gene-page-modifications) it says:

Note: for histones, residue numbering assumes that the initiator methionine is removed.

Is that correct? If so, I think I'm misunderstanding things.

ValWood commented 11 months ago

I think that is because the documentation predates the manus script,and that previously the instructions were to represent histone modifications using the histone code standard. I'm not sure whether everybody adhered to that guidline, because presumably they would need to manually edit their mass spec output to make the histones match (unless that mass spec processing does this automatically, this seems unlikely because even uniprot reports the modifications on the unprocessed version https://www.uniprot.org/uniprotkb/P09988/entry#ptm_processing).

It might be useful to know how many manus script 'fixed' to the non-modified form @manu do you know this?

But I think we probably just need to update the modification now to say that histones should be reported in the unprocessed form. Basically, we changed the way we do this to make it the same for every protein, but weren't aware of the existing instructions for histones.

ValWood commented 11 months ago

We will also need to correct

https://www.pombase.org/

The Residue column indicates the position modified. For protein modifications, use one-letter amino acid code. Multiple entries are allowed, but only for cases where two or more of the same modification are known to be present at the same time. Separate entries with commas (e.g. S72,T85). Position numbering should reflect the current sequence data in PomBase. Please refer to the Gene Coordinate Changes page to ensure that your residue position entries are up to date. Also note that histones are conventionally numbered assuming the initiator methionine is removed (i.e. every position in the mature protein is numbered, and is 1 less than the apparent numbering predicted by translating the ORF).

ValWood commented 11 months ago

For protein modifications, use one-letter amino acid code. Multiple entries are allowed, but only for cases where two or more of the same modification are known to be present at the same time. Separate entries with commas (e.g. S72,T85)

this section isn't correct, we don't conjoin because this will make modifications impossible to collapse and it isn't super informative since it's biased for close residues which are likely to be on the same peptide fragment. Also, we report the phase when known so we can get modifications that co-occur this way.

we can tell people this isn't necessary, and why Please refer to the Gene Coordinate Changes page to ensure that your residue position entries are up to date.

I can rewrite a shorter version tomorrow.

kimrutherford commented 11 months ago

The residues are displayed correctly for histones now. I had some bugs to fix when displaying on pages that have a mixture of histones and non-histones like: https://www.pombase.org/term/MOD:00408

That's now fixed but let me know if you see any problems.

We talked about being more explicit in the detailed display (eg."K56(K57)` processed(preprocessed)"). Should we do that? Here's how it looks in the test version on my desktop:

kimrutherford commented 11 months ago

For now I've removed "Note: for histones, residue numbering assumes that the initiator methionine is removed." from the docs while work on it.

ValWood commented 11 months ago

We talked about being more explicit in the detailed display (eg."K56(K57)` processed(preprocessed)"). Should we do that? Here's how it looks in the test version on my desktop:

It doesn't hurt, but I think most biologists will understand without. Put it in for now as there is space...

kimrutherford commented 11 months ago

OK, I'll added that. The main site will have the change in a little while.

ValWood commented 11 months ago

Add text at the top in the "Notes" section.

We have a QC pipeline in place to check that the modified residues match the current protein sequence coordinates. For most proteins where the protein sequence coordinates have changed, we will be abo to automatically "lift over" to the current sequence residue numbering.

Change text Also note that histones are conventionally numbered assuming the initiator methionine is removed (i.e. every position in the mature protein is numbered, and is 1 less than the apparent numbering predicted by translating the ORF). to Histones should be represented using the unprocessed protein sequence coordinates, not the processed coordinates conventionally used to describe histones. Histone modifications will be represented on the gene pages as K4(K5) processed(preprocessed), but our checking pipeline will expect unmodified forms.

ValWood commented 11 months ago

Edited text above. @kimrutherford can you make this change and ten this ticket can close

kimrutherford commented 11 months ago

can you make this change and ten this ticket can close

Excellent, thanks. I've made that text change.

pombase / website

Improve display of histone modifications #2116