msinclair2 / so-refactored

0 stars 0 forks source link

Design Patterns for SO/MSO #1

Open msinclair2 opened 6 years ago

msinclair2 commented 6 years ago

@dosumis

In discussion with @keilbeck, we realized that there will be some classes in both the SO and MSO that will have no counterpart in the other ontology. For example, in the MSO, there will be some sequence molecules that either contain no information or for which the information is not yet known. In the SO, there will be some arbitrary artifacts that are not easy to associate with discrete sequence molecules or regions of molecules, such as assemblies and contigs. It won't be possible, then, to generate one ontology from the other in toto, though it should be possible to generate the majority of classes of one ontology from the other. @mikebada has already done a lot of work creating an independent MSO that is BFO-compliant, that integrates with ChEBI, and that annotates what the counterpart for each class is in the SO if known. And he has done so largely using logical axioms, rather than named classes, relying on the reasoner to classify the taxonomy; that implies a design pattern we can extract. I need to discuss this more in detail with him.

Most(all?) of the entities in SO are information entities. In BFO terms, they are generically dependent continuants. SO has 4 top-level classes, all of which would fall under BFO's generically dependent continuant (I think).

One of these, "sequence_attribute", are "attributes describing a quality of sequence". These are abstract qualities of sequence entities. They inhere in specific qualities of specific molecules described in the MSO. The relation between a generically dependent continuant, like the standard text of a novel, and a specifically dependent continuant, like the color of the ink it's printed in in a particular copy of the book, is "is concretized by". I think this relation would hold between sequence_attributes in SO and qualities in MSO. So I'm ready to sketch out a first design pattern:

  1. abstract sequence quality <---> specific sequence quality To infer MSO classes from SO classes using this pattern, we require that all subclasses of "sequence_attribute" (domain) must be conretized by a corresponding MSO class that is a subclass of BFO "quality" (range).

The other three top level classes of SO (sequence_collection, sequence_feature, and sequence_variant) describe the information (e.g. recognition sites, reference genomes) that inhere in actual sequence molecules as independent continuants in MSO. An anology would be the standard text of a novel and a book, or the print in a book, capable of bearing that text. We can describe now a second design pattern, which will cover the majority of SO and MSO classes:

  1. genomic information <---> specific sequence molecule To infer MSO classes from SO classes using this pattern, we require that all subclasses of "sequence_collection, sequence_feature, and sequence_variant) (domain) must be _generically_dependenton a corresponding MSO class that is a subclass of ChEBI "chemical entity".

Some SO classes describe the recognition sites of boundaries between regions. In this case, these SO classes would inhere rather in boundary entities, which are "immaterial entities in BFO. So we have yet another pattern:

  1. genomic boundary information <---> specific sequence boundary To infer MSO classes from SO classes using this pattern, we require that all subclasses of "junction" which is a subclass of "sequence_feature" (domain) must be _generically_dependenton a corresponding MSO class that is a subclass of BFO's "immaterial entity" (range).

From these the majority of classes that should belong to BFO will be generated from SO. We need further a way to infer from SO the key annotations (such as rdfs:label and definition) for a minimally sufficient class definition for the MSO classes. These patterns will also tell us what upper level BFO or ChEBI class the MSO classes are subclasses of, to begin reconstructing the taxonomy of MSO.

A fourth design pattern, to complete the full taxonomic reconstruction of MSO, can in consultation with @mikebada be extracted from the work he has already done.

There will need to be two additional templates:

  1. Generating the subontology of classes that exist only in MSO with no relation to any SO class, and

  2. Generating the subontology of classes that exist only in SO with no relation to any MSO class.

An additional design principle we have adopted for convenience is:

The principle of identical 7 digit IDs Classes in SO and MSO that are connected by a "generically depends on" or "is concretized by" relation should have the exact same 7 digit ID number, just with different prefixes (SO vs MSO) to make programmatic IRI identification of counterparts much easier.

For the addition of future terms:

Any time a new term in SO is added that requires generic dependence on some bearer, an MSO counterpart term with the same 7-digit ID but different namespace must be generated at the same time, and its precise place in the MSO taxonomy decided upon. On the other hand, the only time a new MSO term should be created independently of SO is when some molecular feature with no known genomic annotation or informational importance is discovered and needs to be described.

I am new to this discussion, and have not been in the loop of all the earlier conversations in the community about this project and what the community wants and needs for practical purposes. I value all feedback not just on the theoretical aspects of design templates, and what steps to take next, but also the practical aspects of fulfilling what the community needs and not including superfluous features.

dosumis commented 6 years ago

Hi Michael,

Rather a lot to take in here. A few quick comments:

  1. The principle of identical 7 digit IDs Classes in SO and MSO that are connected by a "generically depends on" or "is concretized by" relation should have the exact same 7 digit ID number, just with different prefixes (SO vs MSO) to make programmatic IRI identification of counterparts much easier.

I would strongly advise not relying on this. The whole point of numeric IDs is that they are completely free of semantics. It's a heavy burden on maintenance to make sure IDs always line up. One ID in the wrong place can mess it up. If you merge or obsolete in one ontology you always have to do so in the other. There's a strong temptation to build software that relies on ID mapping & then you have the burden for ever. And you may decide that there are cases which are not 1:1 SO:MSO. I say all this as someone who's been burned by this in the past.

The mapping also doesn't make sense without equivalentTo axioms connecting SO and MSO. SubClassOf axioms will be true for all subclasses.

  1. As most of the classification hierarchy of MSO and SO will be identical it is essential that you come up with patterns to infer classification in one from classification in the other (for all relevant classes). Without this, they'll inevitably go out of sync.

  2. The patterns you've outlined look reasonable. I don't have much time to chat about abstract/upper-ontology modelling, but could potentially help a bit to make sure the patterns you want do useful work in automating classification. Do you have any draft equivalent Class axioms? It might help to build a toy ontology which has examples of terms defined with the various patterns you outline above + some relevant imports. You could then use a reasoner (Elk is probably best) to test classification.

msinclair2 commented 6 years ago

That you @dosumis for the warning about same IDs. I will incorporate your suggestion.

I would like to ask you to clarify a bit what you mean by "equivalent Class axioms". Do you mean the OWL definition, and the common best practice in design, where you define a "closed world" set with both an existential and universal quantifier, and make that "equivalent to" a named class? Could you give me an example? I am new to ontology design, but I know the basics and am a quick study if pointed in the right direction.

dosumis commented 6 years ago

equivalent Class axioms = equivalentTo axioms ( e.g. 'arm bone' EquivalentTo bone and 'part of' some arm).

I wouldn't advise using the closure pattern. While potentially useful, it doesn't scale. Best to stick to OWL2 EL and use the ELK reasoner. In practise this means existential restrictions only.

msinclair2 commented 6 years ago

I understand @dosumis. These are defined classes, with necessary and sufficient conditions for membership. Any individual for which the axioms hold true are a member of the class, and all members of the class must satisfy the axioms. On the other hand, with a mere inclusion (SubClassOf) axiom, we can only infer that if an individual is a member of the class, it must satisfy the axiom. But just because an individual satisfies the axiom, it does not mean it is a member of the class necessarily. (difference between if and only if, and a mere if/then).

There are already many equivalent class axioms that @mikebada has written in his draft of the MSO. I'll pull some examples later today.

The taxonomy of MSO is not the same as the SO, at least not as we have it so far. Part of the reason is that Mike has integrated MSO into ChEBI and follows its structure, because the MSO describes biological molecules. The SO is not structured by ChEBI. We will need to use ChEBI as well to generate the MSO.

msinclair2 commented 6 years ago

Sorry for the delay @dosumis, just discussing design issues with @mikebada before responding.

msinclair2 commented 6 years ago

Hi @dosumis, we (myself, @mikebada, @keilbeck) are still discussing which ontology (MSO or SO) to use as the base to infer the other from. Once we hash it out I will get back with succinct and precise design patterns. I appreciate your patience, interest, and help!

mikebada commented 6 years ago

I previously suggested that each SO class could be necessarily and sufficiently defined (i.e., with an OWL equivalentClass axiom) simply as being generically dependent on its corresponding MSO class, e.g., in Manchester OWL syntax:

SO:gene equivalentTo (generically_depends_on some MSO:gene)

One problem I can think of with this approach is that such a definition obviously can’t be created for an SO class that doesn’t have an analog in the MSO, e.g., SO:assembly.

Additionally, I figured that the formal definitions of the current public SO could just be transferred to the corresponding MSO classes, e.g.:

SO:intronic_regulatory_region equivalentTo (SO:transcription_regulatory_region and part_of some SO:intron)

would be removed from the SO but transferred to the corresponding MSO class:

MSO:intronic_regulatory_region equivalentTo (MSO:transcription_regulatory_region and part_of some MSO:intron)

One issue with this approach is that these useful definitions would be removed from the SO. This might be OK for those SO classes that have MSO analogs (which will be the very large majority of classes, I think), but a problem arises for those SO classes that don’t have MSO analogs, as those definitions would then be lost. The same thing would happen to the necessary axioms for such classes as well.

So, what I propose is that we keep the necessary and sufficient and the necessary axioms in the SO (with an important caveat, later) and also recreate them in the MSO, e.g., have both:

SO:intronic_regulatory_region equivalentTo (SO:transcription_regulatory_region and part_of some SO:intron) MSO:intronic_regulatory_region equivalentTo (MSO:transcription_regulatory_region and part_of some MSO:intron)

In addition to keeping these axioms in the SO (as well as recreating them in the MSO), we could also add necessary axioms linking the SO classes to the MSO classes, e.g.,

SO:intronic_regulatory_region subclassOf (generically_depends_on some MSO:intronic_regulatory_region)

The aforementioned caveat is that it wouldn’t make sense to keep all of the axioms in the SO, as some clearly only apply to MSO classes, e.g.:

MSO:enzymatic_RNA equivalentTo (MSO:transcript and has_quality some MSO:enzymatic)

It wouldn’t make sense to have this definition in the SO as well, as the generically dependent sequence entities obviously don’t have enzymatic functionality. So, we’d have to figure out which kinds of axioms should appear in both the SO and the MSO and which should be transferred exclusively to the MSO. I think all of the “topological” axioms (e.g., part_of, adjacent_to, overlaps)--which I’m guessing constitute most of the axioms--can be represented in both.

So, I think that’s my current thinking. Lemme know if you’d like to discuss...

msinclair2 commented 6 years ago

Hi @mikebada, thanks for weighing in and summarizing your thoughts.

There will definitely be classes in both SO and MSO that have no counterpart in the other. We will have to maintain lists of these separately. I see no way around this. They should be limited in number, and should have certain common features that would allow us to make design patterns for them. In the MSO, these would be sequence entities with no known significance, and in the SO, artificial collections of sequence. And, as @dosumis has warned, there may be classes in SO or MSO that don't stand in a 1:1 relationship.

I agree that we need the equivalence axioms in both ontologies. Without them, there would be no way for a reasoner to classify them. If we are going to generate the SO from the MSO and make the taxonomic structure of the SO the same or similar to MSO, we need to copy equivalentTo axioms from an MSO class to its SO counterpart. That seems easy enough. What we need to decide is how and where the taxonomy of SO will differ (not just at the upper level, which we already know about).

What David has advised us to do in order to help us is to have draft equivalentTo axioms for each of our design patterns, as well as a minimal ontology that contains examples of terms as would be generated by each of our design patterns.

I think the patterns I outlined in the OP are still useful, it's just that we will take the MSO, rather than the SO, as our base.

So an example for pattern 2, genomic information <---> specific sequence molecule, which you've provided, would be:

given:

MSO:intronic_regulatory_region equivalentTo (MSO:transcription_regulatory_region and part_of some MSO:intron)

generate:

SO:intronic_regulatory_region equivalentTo (SO:transcription_regulatory_region and part_of some SO:intron) SO:intronic_regulatory_region subclassOf (generically_depends_on some MSO:intronic_regulatory_region)

QUESTION: Should the "generically_depends_on" property here be made part of the equivalentTo axiom (connected with an "and"), or can it stay as a subclassOf axiom?

I'll find examples of the other 2 patterns, and put all examples in a toy ontology. Then we can try writing draft yaml dosdp patterns, and then ask David for his opinion.

On Mon, Oct 16, 2017 at 1:37 AM, mikebada notifications@github.com wrote:

I previously suggested that each SO class could be necessarily and sufficiently defined (i.e., with an OWL equivalentClass axiom) simply as being generically dependent on its corresponding MSO class, e.g., in Manchester OWL syntax:

SO:gene equivalentTo (generically_depends_on some MSO:gene)

One problem I can think of with this approach is that such a definition obviously can’t be created for an SO class that doesn’t have an analog in the MSO, e.g., SO:assembly.

Additionally, I figured that the formal definitions of the current public SO could just be transferred to the corresponding MSO classes, e.g.:

SO:intronic_regulatory_region equivalentTo (SO:transcription_regulatory_region and part_of some SO:intron)

would be removed from the SO but transferred to the corresponding MSO class:

MSO:intronic_regulatory_region equivalentTo (MSO:transcription_regulatory_region and part_of some MSO:intron)

One issue with this approach is that these useful definitions would be removed from the SO. This might be OK for those SO classes that have MSO analogs (which will be the very large majority of classes, I think), but a problem arises for those SO classes that don’t have MSO analogs, as those definitions would then be lost. The same thing would happen to the necessary axioms for such classes as well.

So, what I propose is that we keep the necessary and sufficient and the necessary axioms in the SO (with an important caveat, later) and also recreate them in the MSO, e.g., have both:

SO:intronic_regulatory_region equivalentTo (SO:transcription_regulatory_region and part_of some SO:intron) MSO:intronic_regulatory_region equivalentTo (MSO:transcription_regulatory_region and part_of some MSO:intron)

In addition to keeping these axioms in the SO (as well as recreating them in the MSO), we could also add necessary axioms linking the SO classes to the MSO classes, e.g.,

SO:intronic_regulatory_region subclassOf (generically_depends_on some MSO:intronic_regulatory_region)

The aforementioned caveat is that it wouldn’t make sense to keep all of the axioms in the SO, as some clearly only apply to MSO classes, e.g.:

MSO:enzymatic_RNA equivalentTo (MSO:transcript and has_quality some MSO:enzymatic)

It wouldn’t make sense to have this definition in the SO as well, as the generically dependent sequence entities obviously don’t have enzymatic functionality. So, we’d have to figure out which kinds of axioms should appear in both the SO and the MSO and which should be transferred exclusively to the MSO. I think all of the “topological” axioms (e.g., part_of, adjacent_to, overlaps)--which I’m guessing constitute most of the axioms--can be represented in both.

So, I think that’s my current thinking. Lemme know if you’d like to discuss...

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/msinclair2/so-refactored/issues/1#issuecomment-336804527, or mute the thread https://github.com/notifications/unsubscribe-auth/ATWldsKyaCNAEPNdUJGpAQ7ZkJ_qAh3fks5ssweygaJpZM4Px1k2 .

msinclair2 commented 6 years ago

David (@dosumis),

I'm at a bit of a loss on how to proceed without ID mapping. SO has been in continuous use for years and we can't go changing IDs that users are already familiar with. But if we want to generate SO from MSO, how do we make sure the right IDs are generated? Should we use ID mapping once to refactor the existing SO and then not use it for new terms? But then, don't we want to be able to dynamically generate the entire SO ontology from the MSO at any time, to ease the burden on curators and prevent them going out of sync due to human error?

There are a lot of places where I'm confused how to proceed, and that I'd like to ask you about, but I'm trying to keep my questions in manageable little chunks so I do not overwhelm you and turn you off....

msinclair2 commented 6 years ago

It turns out all I needed to do was make the "generically_depends_on some MSO_class" axiom for an SO class an equivalentTo, rather than subclassOf. So long as the MSO is imported in the same space, a reasoner can automatically infer the correct hierarchy for SO based on the MSO classes they depend on.

I verified this with a toy ontology, where I manually created 4 SO classes and all I added was an equivalentTo "generically_depends_on some" MSO counterpart. With MSO direct imported, I ran the reasoner, and the 4 SO classes were classified exactly corresponding to the MSO taxonomy.

Big thanks to Mike Bada for pointing this out to me!