Add protein in blood serum traits

rays22 commented 1 year ago

This commit intends to add 4740 level of XXX-PROTEIN in blood serum trait terms. If applied, this commit will fix #222 and addresses https://github.com/EBISPOT/efo/issues/1848

rays22 commented 1 year ago

I need help with updating pr_slim with new terms.

rays22 commented 1 year ago

Please note, that these Uniprot/PR terms are all human specific, which are fine for human traits, but may not be optimal for cross-species traits. For cross-species protein level traits, should we not use more generic (species neutral) PR classes to avoid term proliferation and conceptual drift?

matentzn commented 1 year ago

Thank you @rays22!

Here are the docs for updating the PRO slim: https://obophenotype.github.io/bio-attribute-ontology/editors-guide/#refresh-pro-slim feel free to improve them as well, and reach out if there are issues

For cross-species protein level traits, should we not use more generic (species neutral) PR classes to avoid term proliferation and conceptual drift?

HMMM good point. This is a difficult call to make, as it has two overarching issues:

We have the overhead of having to map all OBA term requests to the species-neutral variant every time
Explain to requestors (e.g. GWAS) how they can go from their PRO data to the OBA classes (not impossible, but overhead)
Maybe having to deal with the fact that our requestors actually wanted a specific human protein.

My intuition says: lets keep it simple and add as is, and if we have to deal with cross species, we apply a merge workflow on select classes. What do you think?

As an aside:

In what ways can human and animal proteins differ from each other?

jamesamcl commented 1 year ago

For cross-species protein level traits, should we not use more generic (species neutral) PR classes to avoid term proliferation and conceptual drift?

HMMM good point. This is a difficult call to make, as it has two overarching issues:

We have the overhead of having to map all OBA term requests to the species-neutral variant every time

Explain to requestors (e.g. GWAS) how they can go from their PRO data to the OBA classes (not impossible, but overhead)

Maybe having to deal with the fact that our requestors actually wanted a specific human protein.

My intuition says: lets keep it simple and add as is, and if we have to deal with cross species, we apply a merge workflow on select classes. What do you think?

As an aside:

In what ways can human and animal proteins differ from each other?

They are mostly simply different proteins (i.e. different sequences of amino acids). They can be grouped by having similar structure and/or function but this is grouping is not some kind of objective truth, and the groupings are always subject to change - same as gene orthologs.

If the paper observed some specific human protein we need to record that; it would be incredibly reductive/possibly incorrect to record a protein family instead. The whole point of the ontology is to provide the hierarchy to align based on superclasses.

I think providing the organism specific terms but inheriting the hierarchy from PRO (to define the generic families) would be useful though?

matentzn commented 1 year ago

They are mostly simply different proteins (i.e. different sequences of amino acids). They can be grouped by having similar structure and/or function but this is grouping is not some kind of objective truth, and the groupings are always subject to change - same as gene orthologs.

Thanks for the clarification, in this case, yes, definitely stick with the Human proteins for now!

I think providing the organism specific terms but inheriting the hierarchy from PRO (to define the generic families) would be useful though?

The PRO hierarchy is automatically inherited if both the grouping and specific protein are present in the trait, but we don't want to proactively add all possible parent proteins to OBA - the hierarchy will emerge as we are adding new terms based on specific use cases..

If we see that it would be generally useful to add all parents, we should talk about this separately, but I am worried that this would mean that "adding 5000 new classes" would turn into "adding 10000 new classes".

rays22 commented 1 year ago

F.Y.I: The species-neutral PRO categories (which I am not using as per your suggestions) are all defined as Protein X product of the human X gene or a 1:1 ortholog thereof.

jamesamcl commented 1 year ago

F.Y.I: The species-neutral PRO categories (which I am not using as per your suggestions) are all defined as Protein X product of the human X gene or a 1:1 ortholog thereof.

Thanks it's reassuring that they are only for 1:1 orthologs - but I still think we should make it clear which specific protein was measured, otherwise you would have to infer it from the study? e.g. looking at the first one in your list:

https://www.ebi.ac.uk/ols4/ontologies/pr/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPR_000030806?lang=en

this is a category so does not have a corresponding uniprot record. I would have to use one of the subclasses.

But the OBA term alone (being species independent) would not make it clear which one I should look at.

Also I am not sure that the recorded orthology will not change over time, especially for less well-understood proteins/organisms?

rays22 commented 1 year ago

I would like to explain my concerns in more detail, because I am not sure if I have been clear in my explanations earlier. I want to make sure that we make well informed decisions as a team.

The advantage of being precisely vague when defining ontological categories is that it allows annotation at varying levels of precision. Let's take for example
- PR:000030806 succinate dehydrogenase assembly factor 1, mitochondrial and
- PR:A6NFY7 succinate dehydrogenase assembly factor 1, mitochondrial (human)

You can use PR:000030806 succinate dehydrogenase assembly factor 1, mitochondrial to annotate both human and mouse traits. It would be wrong to use PR:A6NFY7 succinate dehydrogenase assembly factor 1, mitochondrial (human) to annotate mouse traits (unless it is a transgenic mouse trait). If you define your trait categories in a species-specific way, than you need to create shadow hierarchies that follow the taxon hierarchies. I thought OBA was intended to be a cross-species ontology to help integrate trait and phenotype data, for example from mouse and human databases. Please, note that mouse- and human-specific traits are already disjoint regardless of any hypothetical future change in how PRO records orthologs.

PR:000030806 succinate dehydrogenase assembly factor 1, mitochondrial is defined as A protein that is a translation product of the human SDHAF1 gene or a 1:1 ortholog thereof. According to PRO, the Gene Template of A6NFY7 is HGNC:33867 SDHAF1. I expect that the product of the same gene will always remain an orthologue of itself. In case PR:000030806 gets obsoleted in PRO, then the same applies to PR:A6NFY7. There is no disadvantage using PR:000030806 compared to PR:A6NFY7 from the perspective of future changes in PRO orthologue groupings.
The UniProt IDs were provided by the GWAS Catalog at our request to help us disambiguate the plain text protein names.

Explain to requestors (e.g. GWAS) how they can go from their PRO data to the OBA classes (not impossible, but overhead) Maybe having to deal with the fact that our requestors actually wanted a specific human protein.

I do not think that the GWAS curators would demand an explanation, but I would be happy to provide one and listen to any justifications or arguments of why they would prefer one PRO term over the other. Please, remember that they have been using EFO protein level measurement terms based on non-standard plain text protein names without any complaints. Using PR:000030806 succinate dehydrogenase assembly factor 1, mitochondrial would be already a step forward in precision, granularity, and FAIRness. The OBA trait category using PR:000030806 succinate dehydrogenase assembly factor 1, mitochondrial as the component term would provide the necessary precision for the GWAS Catalog. I could be convinced otherwise, but please provide some evidence or explanation.

For pragmatic reasons, we can add the ~4.7k human protein level trait categories to OBA, so that next time it is easier to look up these traits based on the UniProt IDs for GWAS annotation. @udp , please, keep in mind that OBA is meant to serve several species-specific databases. If we want them to use OBA protein level trait terms (some of which they already do), then we will have to add the shadow terms to the human-specific ones.

I hope this helps.

jamesamcl commented 1 year ago

Thanks Ray; I am about 80% convinced that we should use the superclasses now.

You can use PR:000030806 succinate dehydrogenase assembly factor 1, mitochondrial to annotate both human and mouse traits. It would be wrong to use PR:A6NFY7 succinate dehydrogenase assembly factor 1, mitochondrial (human) to annotate mouse traits (unless it is a transgenic mouse trait).

I don't think we should ignore the possibility of annotating transgenic studies or other cases when you want to measure levels of a different organism's protein e.g. when raising antibodies; but I suppose that it will always be possible to add more specific terms for these rare cases as needed.

If you define your trait categories in a species-specific way, than you need to create shadow hierarchies that follow the taxon hierarchies. I thought OBA was intended to be a cross-species ontology to help integrate trait and phenotype data, for example from mouse and human databases.

This is my understanding too, but I am just wary of doing the integration prematurely/discarding more specific information present in the paper.

... is defined as A protein that is a translation product of the human SDHAF1 gene or a 1:1 ortholog thereof. According to PRO, the Gene Template of A6NFY7 is HGNC:33867 SDHAF1. I expect that the product of the same gene will always remain an orthologue of itself.

Yes, I'm not worried about this, I'm just worried about possible situations where:

Experiment B measures protein P in human
Experiment A measures protein P' in, say, drosophila
At the time P' is thought to be the (gene product) ortholog of P, so we record both experimental results using the same identifier
Later it is found that P'' is a more likely to be the ortholog of P and the hierarchy in UniProt/PR changes

I have no idea how often this kind of thing happens in practice and I'm sure it's unlikely to happen often for well-studied proteins in model organisms like human/mouse, but if we are establishing a best practice for OBA I think it is worth at least evaluating if it's likely to be a problem.

In case PR:000030806 gets obsoleted in PRO, then the same applies to PR:A6NFY7. There is no disadvantage using PR:000030806 compared to PR:A6NFY7 from the perspective of future changes in PRO orthologue groupings.

Yes I'm not too worried about the obsoletion either. More concerned about being able to easily find a corresponding UniProt record for what was measured in the study, because we would be mapping to a category instead so end users of the data would have to drill down the hierarchy to find the mapping back to a specific protein.

The UniProt IDs were provided by the GWAS Catalog at our request to help us disambiguate the plain text protein names.

I forgot about this, and the fact that they were plain text protein names strongly supports your argument for using the generic versions (because we wouldn't be losing any data from the paper).

For pragmatic reasons, we can add the ~4.7k human protein level trait categories to OBA, so that next time it is easier to look up these traits based on the UniProt IDs for GWAS annotation. @udp , please, keep in mind that OBA is meant to serve several species-specific databases. If we want them to use OBA protein level trait terms (some of which they already do), then we will have to add the shadow terms to the human-specific ones.

I will chat with the GWAS catalog and see if they have any input.

rays22 commented 1 year ago

I am also less certain than 99.3% that what I propose is the right choice. :) I think I can match 99.3% of the UniProt IDs to generic (human + '1:1 ortholog thereof') PR categories (this table) from the UniProt list in this PR. It seems that there will be always human specific proteins (with no clear orthologue grouping in PR) from this kind of proteomics studies. There are also some PR:UniProt entities in the list that are two steps away from the generic class, i.e. protein X isoform (human) --> protein X (human) --> protein X), which are more difficult to match. Here is a table of the 31 human specific protein examples:

#ID LABEL
PR:A1A4F0   putative uncharacterized protein SLC66A1L (human)
PR:A1L168   uncharacterized protein C20orf202 (human)
PR:A1L170   uncharacterized protein C1orf226 (human)
PR:E0CX11   short transmembrane mitochondrial protein 1 (human)
PR:K9M1U5   interferon lambda-4 (human)
PR:O60449   lymphocyte antigen 75 isoform 4 and LY75-CD302 fusion isoforms V34-2/V33-2 (human)
PR:P01602   immunoglobulin kappa variable 1-5 (human)
PR:P0CG32   zinc finger CCHC domain-containing protein 18 (human)
PR:P0DP42   transmembrane protein 225B (human)
PR:P57076   cilia- and flagella-associated protein 298 (human)
PR:P59665   neutrophil defensin 1 (human)
PR:P59666   neutrophil defensin 3 (human)
PR:Q01523   defensin alpha 5 (human)
PR:Q15053   uncharacterized protein KIAA0040 (human)
PR:Q56UQ5   TPT1-like protein (human)
PR:Q6MZM9   proline-rich protein 27 (human)
PR:Q6PL45   BRICHOS domain-containing protein 5 (human)
PR:Q6UXQ4   uncharacterized protein C2orf66 (human)
PR:Q6ZUB0   spermatogenesis-associated protein 31D4 (human)
PR:Q6ZVL6   UPF0606 protein KIAA1549L (human)
PR:Q6ZWK4   regulator of hemoglobinization and erythroid cell expansion protein (human)
PR:Q8IXM2   chromatin complexes subunit BAP18 (human)
PR:Q8N812   uncharacterized protein C12orf76 (human)
PR:Q8NEA5   uncharacterized protein C19orf18 (human)
PR:Q8TE69   protein EOLA1 (human)
PR:Q8TEF2   uncharacterized protein C10orf105 (human)
PR:Q8WUE5   cancer/testis antigen 55 (human)
PR:Q8WWF3   serine-rich single-pass membrane protein 1 (human)
PR:Q8WYQ4   uncharacterized protein C22orf15 (human)
PR:Q96HG1   small integral membrane protein 10 (human)
PR:Q96LM9   uncharacterized protein C20orf173 (human)

I would add them as they are (PR: UniProt).

rays22 commented 1 year ago

I am re-doing these terms in https://github.com/obophenotype/bio-attribute-ontology/pull/253 as per discussions at the Monarch/SPOT coordination meeting today.

cmungall commented 11 months ago

Only just saw this.

I would have said use the species specific proteins with IDs from uniprot

clear SOP
uniprot identifiers are well understood, no need for an additional layer of mapping and identifier juggling
most of the PRO species neutral forms are pseudogeneralizations / trivial homology shadows of the human uniprot
what is the SOP for adding proteins for species PRO does not cover?
what is the SOP for adding a term from mouse, zfin, whatever, where there is no 1:1? Request a pseudogeneralization, wait til next PRO release, etc. Seems complicated
in general it's better to be specific
ontology generalization is fine for things like finger1->finger->digit, but we should be more careful with homology, which should not be baked on
if people really want cross-species generalizations then this can be done at query time swapping in a homology resource of choice (panther, diopt, ...), rather than baking in

obophenotype / bio-attribute-ontology

Add protein in blood serum traits #251