obophenotype / bio-attribute-ontology

source files for OBA (Ontology of Biological Attributes)
https://obophenotype.github.io/bio-attribute-ontology
Creative Commons Zero v1.0 Universal
27 stars 11 forks source link

Reconsidering the use of PRO homology classes #269

Open matentzn opened 11 months ago

matentzn commented 11 months ago

From @cmungall : Only just saw this.

I would have said use the species specific proteins with IDs from uniprot

Originally posted by @cmungall in https://github.com/obophenotype/bio-attribute-ontology/issues/251#issuecomment-1638899053

rays22 commented 4 months ago
  1. clear SOP

Yes, I agree. I will work on it if time allows.

  1. uniprot identifiers are well understood, no need for an additional layer of mapping and identifier juggling

Correct me if I am wrong, but UniProt accessions refer to specific amino-acid sequences of polypeptides encoded by genes. For example, here is the basic human cardiac troponin I UniProt entry:

sp|P19429|TNNI3_HUMAN Troponin I, cardiac muscle OS=Homo sapiens OX=9606 GN=TNNI3 PE=1 SV=3 MADGSSDAAR EPRPAPAPIR RRSSNYRAYA TEPHAKKKSK ISASRKLQLK TLLLQIAKQE LEREAEERRG EKGRALSTRC QPLELAGLGF AELQDLCRQL HARVDKVDEE RYDIEAKVTK NITEIADLTQ KIFDLRGKFK RPTLRRVRIS ADAMMQALLG ARAKESLDLR AHLKQVKKED TEKENREVGD WRKNIDALSG MEGRKKKFES

Note, that the above sequence differs from that of the mouse cardiac troponin I:

sp|P48787|TNNI3_MOUSE Troponin I, cardiac muscle OS=Mus musculus OX=10090 GN=Tnni3 PE=1 SV=2 MADESSDAAG EPQPAPAPVR RRSSANYRAY ATEPHAKKKS KISASRKLQL KTLMLQIAKQ EMEREAEERR GEKGRVLRTR CQPLELDGLG FEELQDLCRQ LHARVDKVDE ERYDVEAKVT KNITEIADLT QKIYDLRGKF KRPTLRRVRI SADAMMQALL GTRAKESLDL RAHLKQVKKE DIEKENREVG DWRKNIDALS GMEGRKKKFE G

For this discussion, let us ignore any protein isoforms or post-translational modifications that result in proteins/polypeptides that differ from the amino-acid sequence of a UniProt accession.

In my opinion it would be wrong to annotate a mouse trait with an OBA class that uses a human-specific P19429|TNNI3_HUMAN Troponin I, cardiac muscle as its entity component. The mouse and human troponin I refer to different entities, hence the differen UniProt accession. OK, we could add another OBA class with the mouse UniProt ID as the component entity and let people figure out if the two cardiac troponin I traits are related or not. How useful would that be for OBA users who would like to integrate phenotypic traits from different model organisms? Semantically any two of the thousands of protein X level would look equally similar (being a direct subclass of protein amount) as the biologically meaningful pair of mouse-human cardiac troponin I level traits.

As of today, there are over 250,000,000 uniprot identifiers. Even if we consider the Swiss-Prot reviewed subset for genetic model organism, it is still hundreds of thousands of uniprot IDs that can be used to create new OBA terms of the type 'UniProt ID' in serum and/or 'UniProt ID' in blood. That is a lot for term inflation.

I have considered the above problems, and I decided to use the PRO homology groupings. They group together orthologous UniProt amino-acid sequence entries from taxons human, mouse and rat. The term request for the protein X level came from the GWAS Catalog (human traits), and the PRO homology grouping classes are defined by the human polypeptide. For example, PR:000016506 troponin I, cardiac muscle is defined as: A protein that is a translation product of the human TNNI3 gene or a 1:1 ortholog thereof. As a result, the term OBA:2045369 troponin I, cardiac muscle level can be used to annotate human, mouse and rat phenotypic traits.

  1. most of the PRO species neutral forms are pseudogeneralizations / trivial homology shadows of the human uniprot

Exactly. That is my point. I expect these terms to be used mostly by human, mouse and rat quantitative traits. Evolutionary homology is more meaningful than none. Uberon is based on evolutionary homology. I could argue that human forelimbs should be assigned different Uberon IDs than the homologous structures of mouse. I could not type this text if I had mouse forelimbs. If I had mouse troponin I in my muscles, I sure would be able to do so to some extent.

  1. what is the SOP for adding proteins for species PRO does not cover?

You can fall back on adding new component terms based on UniProt IDs of polypeptides from species that are outside of the PRO homology groupings (e.g. for fish proteins).

  1. what is the SOP for adding a term from mouse, zfin, whatever, where there is no 1:1? Request a pseudogeneralization, wait til next PRO release, etc. Seems complicated

No, it is not complicated. You can just fall back on adding new component terms based on UniProt IDs of polypeptides from species that are outside of the PRO homology groupings. It is also a less complicated process than when you need to decide if a bone from a fish is the same Uberon bone class as in the human, because the similarity is computable, and provides an objective criterion for homology grouping, if needed.

  1. in general it's better to be specific

Not in this case though. See my example about anatomy terms.

  1. ontology generalization is fine for things like finger1->finger->digit, but we should be more careful with homology, which should not be baked on

Disagree. Homology is baked in many ontologies, e.g. Uberon. See my arguments above. Without evolutionary homology, these OBA classes would be just superficial and biologically meaningless formalisms.

  1. if people really want cross-species generalizations then this can be done at query time swapping in a homology resource of choice (panther, diopt, ...), rather than baking in

Good project for the future. In the meantime, I think I leave these homology grouping terms to help integrate mouse, rat and human quantitative traits.