obophenotype / cell-ontology

An ontology of cell types
https://obophenotype.github.io/cell-ontology/
Creative Commons Attribution 4.0 International
135 stars 49 forks source link

Proposed rule: species-specific marker-defined cells should use species-specific PRO IDs of the same species #695

Open cmungall opened 3 years ago

cmungall commented 3 years ago

This rule could be implemented in SPARQL; if there is an in-taxon then all PRO IDs should be at the species level (i.e uniprot) or below and should reference a PRO class that has a matching in-taxon.

In #689 it seems there is a mix. This will cause problems for most use cases for having markers in the ontology. Classification will be incomplete and this will be less useful for users who want to do things such as analyses based on data in uniprot. Note that we can't infer these, they have to be added manually or heuristically. I also think the rule has to be implemented as SPARQL and not as an OWL axiom

As an example, in the current release 'NK1.1-positive natural killer cell' http://purl.obolibrary.org/obo/CL_0002438 does not have any explicit taxon designation (although a comment says "Nk1.1 expression is restricted to C57BL strains of laboratory mice.").

textdef: A mature NK cell that is NK1.1-positive. logdef: mature natural killer cell and (has plasma membrane part some killer cell lectin-like receptor subfamily B member 1C)

The PRO class is species-neutral, although in this case there is only one child http://www.ontobee.org/ontology/PR?iri=http://purl.obolibrary.org/obo/PR_000002977

In #689 this is relabeled. The text def does not change.

The logdef becomes: 'mature natural killer cell' and ('has plasma membrane part' some 'killer cell lectin-like receptor subfamily B member 1C') and ('in taxon' some 'Mus musculus')

However, the PRO class remains at the gene level. It should be pushed down to the uniprot level http://purl.obolibrary.org/obo/PR_P27814

cmungall commented 3 years ago

So here is an interesting example

In the current release, 'decidual natural killer cell' http://purl.obolibrary.org/obo/CL_0002343 looks to be applicable across mammals: natural killer cell subset that is found in the decidual of the uterus and is CD56-high, Galectin-1-positive and CD16-negative.

mature natural killer cell and (has plasma membrane part some galectin-1) and (has_high_plasma_membrane_amount some neural cell adhesion molecule 1) and (lacks_plasma_membrane_part some low affinity immunoglobulin gamma Fc region receptor III)

[note the deviation between textual definition and OWL def - already a bad smell, see #694]

In #689 this is relabeled as human-specific

http://www.ontobee.org/ontology/PR?iri=http://purl.obolibrary.org/obo/PR_000001483

The text definition remains unchanged (cc @matentzn this is a problem) and the logical def has an in-taxon added:

'mature natural killer cell'
 and ('has plasma membrane part' some galectin-1)
 and ('in taxon' some 'Homo sapiens')
 and (has_high_plasma_membrane_amount some 'neural cell adhesion molecule 1')
 and (lacks_plasma_membrane_part some 'low affinity immunoglobulin gamma Fc region receptor III')

Note that this is immunoglobulin gamma Fc region receptor III:

http://www.ontobee.org/ontology/PR?iri=http://purl.obolibrary.org/obo/PR_000001483

An immunoglobulin gamma Fc receptor II/III/IV that is a translation product of the mouse Fcgr3 gene or a 1:1 ortholog thereof

The fact that the mouse gene is primary here in the definition is a tell. PRO only has one subclass, a mouse one.

Not everyone agrees on the 1:1 ortholog:

https://www.alliancegenome.org/gene/MGI:95500#orthology

It seems the ortholog may be FCGR2A https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:3616 aka CD32

However, this has a totally different PR grouping: http://purl.obolibrary.org/obo/PR_000001480

This may involve a request to PRO...

...however if we look at the cited paper https://pubmed.ncbi.nlm.nih.gov/19800965/

it seems the original owl def was not very good anyway, the paper says CD56(+++)Galectin (Gal)-1(+)CD16(-) as does the text def

lesson: match text defs and owl defs!

dosumis commented 3 years ago

Great example.

I'd suggest a very simple textual and logical definition for the species neutral term, with this term retaining the current ID:

'natural killer cell' that part_of* some decidua,

(* I think we need to enforce a convention for resident vs transitory with a different relation for transitory)

Adding species specific subclasses allows us to move markers down to the species level. With this approach, adding species-specific terms can (hopefully) finally give us the truly species neutral terms some of us have wanted for a long time (e.g. see long-standing issues about integrating Zebrafish immune cells - @cerivs @ybradford )

There are presumably some cases where we need species-neutral PRO terms (e.g. T-cell requiring TCR), but I'd prefer to keep this to the most non-controversial cases. I'm keen that PRO not be a bottleneck. I think this can be achieved at the species-level by generating PRO IDs from Uniprot IDs using the standard pattern.

cmungall commented 3 years ago

@dosumis agreed on all points

Adding species specific subclasses allows us to move markers down to the species level. With this approach, adding species-specific terms can (hopefully) finally give us the truly species neutral terms some of us have wanted for a long time

This would be great

There are presumably some cases where we need species-neutral PRO terms (e.g. T-cell requiring TCR), but I'd prefer to keep this to the most non-controversial cases. I'm keen that PRO not be a bottleneck. I think this can be achieved at the species-level by generating PRO IDs from Uniprot IDs using the standard pattern.

Aside: I wish PRO didn't use their own prefix here and we could just use a uniprot URI that guarantees to resolve

How would we define (textually or logically) the species neutral form? I can imagine doing this as a kind of open union of species-specific forms. In some cases it make be possible to use a functional or other traditional differentia. Of course, not everything needs a logical def but we should at least have a text def (and of course they should align when both are present)

Would we allow two logical defs for species-specific? This would be non-standard but I think it could be justified in many cases here. E.g. a logdefinition that uses in-taxon as differentia and the species-neutral form as genus, and another that is species-specific protein based (does this count as a bidirectional hidden GCI?)

cmungall commented 3 years ago

These are also problematic:

memory CCR4-positive regulatory T cell EquivalentTo 'memory regulatory T cell' and ('has plasma membrane part' some 'receptor-type tyrosine-protein phosphatase C isoform CD45RO') and ('has plasma membrane part' some 'C-C chemokine receptor type 4') and (only_in_taxon some 'Homo sapiens')

effector memory CD8-positive, alpha-beta T cell, terminally differentiated EquivalentTo 'CD8-positive, alpha-beta memory T cell' and ('has plasma membrane part' some 'receptor-type tyrosine-protein phosphatase C isoform CD45RA') and (only_in_taxon some 'Homo sapiens') and (has_completed some 'memory T cell differentiation') and (lacks_plasma_membrane_part some 'receptor-type tyrosine-protein phosphatase C isoform CD45RO') and (lacks_plasma_membrane_part some 'C-C chemokine receptor type 7')

CD8-positive, alpha-beta memory T cell, CD45RO-positive EquivalentTo 'CD8-positive, alpha-beta memory T cell' and ('has plasma membrane part' some 'receptor-type tyrosine-protein phosphatase C isoform CD45RO') and ('has plasma membrane part' some 'interleukin-7 receptor subunit alpha') and (only_in_taxon some 'Homo sapiens') and (has_completed some 'memory T cell differentiation') and (has_high_plasma_membrane_amount some 'CD44 molecule') and (has_high_plasma_membrane_amount some 'interleukin-2 receptor subunit beta') and (lacks_plasma_membrane_part some 'interleukin-2 receptor subunit alpha')

CD4-positive, alpha-beta memory T cell, CD45RO-positive EquivalentTo 'CD4-positive, alpha-beta memory T cell' and ('has plasma membrane part' some 'receptor-type tyrosine-protein phosphatase C isoform CD45RO') and ('has plasma membrane part' some 'interleukin-7 receptor subunit alpha') and (only_in_taxon some 'Homo sapiens') and (has_completed some 'memory T cell differentiation') and (has_high_plasma_membrane_amount some 'CD44 molecule') and (has_high_plasma_membrane_amount some 'interleukin-2 receptor subunit beta') and (lacks_plasma_membrane_part some 'interleukin-2 receptor subunit alpha')

These are clearly overspecified. If the goal is to say that each of these are restricted to human then should be a separate axiom. But this is unneccessary if ss PRO IDs used

I don't understand the rationale for a lot of these. E.g. "memory CCR4-positive regulatory T cell" sounds like it may be applicable to mice (they have mem reg T cells, and they have CCR4. The paper linked as ref describes human nomenclature but nothing in that paper prohibits such a term in mouse. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3409649/ cc @addiehl

addiehl commented 3 years ago

The CD45RO and CD45RA are human specific isoforms of CD45 (these theoretically may be present in non-human primates, but are not present in mouse). These were originally added to PRO without reference to UniProt sequences, since at the time there were no UniProt entries available. UniProt has since added the sequence entries which led to some temporary confusion in PRO (see https://github.com/PROconsortium/PRoteinOntology/issues/167). These could therefore be switched to the explicit human PRO entries PR:P08575-4 (CD45RO) and PR:P08575-8 (CD45RA).

Different markers are used in mice to identify memory T cells.