obophenotype / cell-ontology

An ontology of cell types
https://obophenotype.github.io/cell-ontology/
Creative Commons Attribution 4.0 International
146 stars 49 forks source link

Fix overspecified and problematic OWL definitions for non-leaf classes in immune cell branch #694

Open cmungall opened 4 years ago

cmungall commented 4 years ago

[@dosumis this ticket makes points that you have been making for a decade, apologies if this duplicates an existing ticket, I didn't see one and thought to start one fresh]

In general CL follows sensible Rector Normalization but the immune cell hierarchy includes two kinds of problems:

  1. overspecified OWL definitions: https://douroucouli.wordpress.com/2019/07/29/ontotip-dont-over-specify-owl-definitions
  2. text and OWL definitions are inconsistent https://douroucouli.wordpress.com/2019/07/08/ontotip-write-simple-concise-clear-operational-textual-definitions/, see S11 in the Table and the Seppälä paper

This is not just fussing. This has serious impacts on maintainability and the ability to use reasoning to automate ontology classification. There are also potentially many cryptic errors lurking with these definitions. The impacts for inference are worse for non-leaf classes so I suggest prioritizing these.

An example is 'natural killer cell':

'group 1 innate lymphoid cell'
 and ('capable of' some 'natural killer cell mediated immunity')
 and ('capable of' some 'regulation of immune response')
 and (lacks_plasma_membrane_part some 'CD19 molecule')
 and (lacks_plasma_membrane_part some 'CD3 epsilon')
 and (lacks_plasma_membrane_part some 'membrane-spanning 4-domains subfamily A member 1')
 and (lacks_plasma_membrane_part some 'CD14 molecule')

This is likely overspecified.

It also bears little resemblance to the textual definition: A lymphocyte that can spontaneously kill a variety of target cells without prior antigenic activation via germline encoded activation receptors and also regulate immune responses via cytokine release and direct contact with other cells.

The solution in this case may involve refactoring the logical axioms to a tighter equivalence axiom plus GCIs, potentially alternate equivalence axioms, or perhaps no equivalence axioms and only subClassOf (Necessary conditions only).

A more insidious example is 'group 1 innate immune cell' which has logical def:

'innate lymphoid cell'
 and ('capable of' some 'interferon-gamma production')

and the text definition An innate lymphoid cell that is capable of producing the type 1 cytokine IFN-gamma, but not Th2 or Th17 cell-associated cytokines.

The exclusion criteria in the text def suggests the logical definition is under-specified and the ontology is over-axiomatized. This could cause mis-classification.

Remember, classes do not have to have logical defs /N+S conditions. It is perfectly acceptable to have only N conditions if the true definition is too hard to specify in OWL.

Another example is 'group 3 innate lymphoid cell', equivalent to

'innate lymphoid cell'
 and ('has part' some 'nuclear receptor ROR-gamma isoform 2')

The textual definition is:

An innate lymphoid cell that constituitively expresses RORgt and is capable of expressing IL17A and/or IL-22.

It's not clear that constitutively expressing a protein is the same as having at least one instance as a part, but the text def includes two additional criteria not reflected in the logical definition or any logical axiom. I think in this case from my reading of https://www.nature.com/articles/nri3365 this is reasonably safe but nevertheless IL17A and IL-22 production should be stated as GCIs.

There are many such cases.

The strategy here should be to examine all immune cell classes and their logical axioms side by side, and implement the following process:

  1. if the text definition is narrower than the logical definition, then either:
    • convert the logical def (N+S) to subclassOf [quick and always correct]
    • if time and priorities permit, add clauses to the logical def until it matches [slower]. This is more important the higher up the hierarchy you go.
  2. if the text definition is broader then then logical definition
    • consider splitting into logical def + GCIs [prioritize for non-leaves]
  3. if the text definition otherwise grossly mismatches
    • tag the class using a standard property indicating it is problematic
    • if time and priorities permit, fix in place.
dosumis commented 4 years ago

Remember, classes do not have to have logical defs /N+S conditions. It is perfectly acceptable to have only N conditions if the true definition is too hard to specify in OWL.

Amen

It's worth noting that most of this branch is still manually classified, so there isn't (currently) an expectation that most logical definitions will support auto-classification of the branch. This still leaves the problem of very large lists of inherited marker assertions. I've always worried that these are very hard to check.

  1. if the text definition is narrower than the logical definition, then either:
    • convert the logical def (N+S) to subclassOf [quick and always correct]
    • if time and priorities permit, add clauses to the logical def until it matches [slower]. This is more important the higher up the hierarchy you go.

I like this. It will be interesting to look at logical diffs while this is happening. Given the specific genus typically used, I expect loss of useful inference to be rare, but still worth tracking.

  1. if the text definition is broader then then logical definition
    • consider splitting into logical def + GCIs [prioritize for non-leaves]

Can you provide more detail. Is the idea to use GCIs as a way to record species specific markers?

balhoff commented 4 years ago

Expressions like lacks_plasma_membrane_part some 'CD19 molecule' will produce a classification opposite to the intent. If these are meant to be expanded according to the annotation on lacks_plasma_membrane_part, they should be in annotation properties, or pun the filler classes as individuals, or be stored in a template source rather than OWL.

cmungall commented 4 years ago

if the text definition is broader then then logical definition, consider splitting into logical def + GCIs [prioritize for non-leaves] Can you provide more detail. Is the idea to use GCIs as a way to record species specific markers?

Not necessarily for this case. But apologies, I should have said hidden GCIs

Example:

textdef: "An X cell is a Foo cell that has function F. X cells express M1 and M2" Original OWL def: X = Foo and has-function F and has-part M1 and has-part M2 Proposed OWL axioms: Class: X Equiv: Foo and has-function some F SubClassOf: has-part some M1, has-part some M2

Hidden GCIs are powerful so use advisedly

Speculation: It may be that this pattern is used less frequently in the future if marker type definitions typically moved down to the species level. But there are definitely cases now where this stratification would improve things

dosumis commented 4 years ago

Agree on the lacks relations. Things are not quite as bad as they look though since Alex and James worked on converting to real negation and using this to find inconsistencies.

cmungall commented 4 years ago

what was the strategy for finding inconsistencies? It's not uncommon to have cryptic unwanted inferences without raising an inconsistency especially if ontology is under-axiomtized e.g with disjoints, as is the case for CL https://douroucouli.wordpress.com/2018/08/03/debugging-ontologies-using-owl-reasoning-part-1-basics-and-disjoint-classes-axioms/

If all negated statements are restricted to the same level in PRO AND we are working at the species generic level then this should minimize issues. But the moment we bring in isoforms or species specific subclasses we get cryptic incorrect inferences up the wazoo.

dosumis commented 4 years ago

what was the strategy for finding inconsistencies?

I believe they managed to modularise, add explicit negation and use HermiT. @jamesaoverton & @addiehl should be able to fill you in on the details. Would need a robust approach to modularization in future to ensure scaling though.

But the moment we bring in isoforms or species specific subclasses we get cryptic incorrect inferences up the wazoo.

Yep. We once had inference of acellular from the axioms "lacks some 'lobed nucleus'". Looking at the classification of PRO & GO terms used for molecular definitions used in lacks statements, there's already lots of potential for problems like this, e.g. 'natural helper lymphocyte' can be found by the DL query lacks_plasma_membrane_part some 'T cell receptor complex' because it has the axiom clause lacks_plasma_membrane_part some 'alpha-beta T cell receptor complex' and 'alpha-beta T cell receptor complex' subClassOf 'T cell receptor complex'. Nothing in the definition precludes this cell type from expressing a different isoform.

Given this, perhaps moving towards Jim's suggestion of punning the filler classes as individuals would be the safest & least disruptive approach (see https://arxiv.org/abs/1410.3862)

jamesaoverton commented 4 years ago

I found my notes from 2019-05-30. Our goal was to run a quick experiment looking for contradictions between the various membrane part relations.

Starting with the full Cell Ontology:

  1. all object properties removed except for:
    • has plasma membrane part
    • lacks_plasma_membrane_part
    • has_low_plasma_membrane_amount
    • has_high_plasma_membrane_amount
  2. has_low_plasma_membrane_amount and has_high_plasma_membrane_amount mapped to just 'has plasma membrane part'
  3. all "lacks_plasma_membrane_part some X" axioms replaced by "'has plasma membrane part' only (not X)"
  4. run HermiT

We found a list of 14 possible errors (some of them duplicates) that Alex reviewed. A few of those were genuine problems and I believe that Alex submitted fixes.

It would be better not to remove all the other object properties (step 1), but we wanted to use HermiT to handle the negation, and HermiT was failing for me on the full Cell Ontology.

dosumis commented 4 years ago

all "lacks_plasma_membrane_part some X" axioms replaced by "'has plasma membrane part' only (not X)"

Shouldn't that be:

not (has_plasma_membrane_part some X) ?

balhoff commented 4 years ago

all "lacks_plasma_membrane_part some X" axioms replaced by "'has plasma membrane part' only (not X)"

Shouldn't that be:

not (has_plasma_membrane_part some X) ?

I believe these are logically the same, but some incomplete reasoners will have better support for one vs. the other. For example ELK can find some contradictions for the second form.

I think it is important that the ontology not publish (or ever contain) the "lacks" existential restrictions. The case for real negation here is much clearer than for the "absence" phenotypes that we struggle with. They are obviously wrong (i.e. we don't need to run a reasoner to find that out), and the CL release pipeline is not the only tool running an OWL reasoner on CL—these will feed into any reasoners used by downstream users.

dosumis commented 4 years ago

Agreed, but there's always a big demand for these from immunologists, so we need a way to retain them in some less dangerous form - especially where axioms are used by other resources as a (non-logical?) reference. Do you think punning plus value restriction would be sufficient? I think this is probably the least disruptive approach. @addiehl can provide background on how these are being used outside of an OWL context.

balhoff commented 4 years ago

I like the punning—I think it makes sense because you are kind of directly talking about the class. Not sure how this comes out in OBO though, if that is an issue.

dosumis commented 4 years ago

We're already outside of OBO with nested class expressions, so I don't think that matters.

addiehl commented 4 years ago

Immunologists use the expression or lack of expression of certain key markers to identify or exclude major cell lineages. CD3 episilon = marker of the T cell lineage CD19 = marker of the B cell lineage CD14 = marker of monocytes/macrophages/neutrophils/dendritic cells (myeloid cells)

Sometimes immunologists identify cell types by what a cell both expresses and does not express.

We have included these lacks statements about lineage markers on immune cell types in the CL to facilitate analyses of flow cytometry and CyTOF results.

dosumis commented 4 years ago

With punning we'd turn these from lacks some CD19 tolacks value CD19` as a way of retaining the axioms but avoiding dangerous inference over PRO heirarchy.

addiehl commented 4 years ago

That sounds fine to me, though I don't fully understand the technical implementation of punning.