obophenotype / uberon

An ontology of gross anatomy covering metazoa. Works in concert with https://github.com/obophenotype/cell-ontology
http://obophenotype.github.io/uberon/
Other
130 stars 29 forks source link

Generate matrix of start/end stages #423

Closed fbastian closed 3 years ago

fbastian commented 10 years ago

Could you please generate a matrix of start/end stages, with the following specifications:

It means that an anatomical entity could be present on several rows, for as many taxon-specific terms as it is merged with.

So we could imagine a file with the following columns: Anat. entity ID Taxon ID (default would be metazoa) Start object property Start dev. stage ID End object property End dev. stage ID

Is it a reasonable request, or is it too much? :p

cmungall commented 10 years ago

OK, this is harder than I thought.

I'm doing this using OWL reasoning rather than a hacky xref-based mechanism. This will be more scalable in future, it can make use of taxon GCIs in uberon, etc.

One issue is that the composite ontologies are odd beasts. EHDAA2 is particularly challenging. I have fixed a lot of issues (for example, if something extends into adulthood, EHDAA2 will say 'ends at CS20'. The solution is to filter out anything that it says ends here).

But we still have oddities like:

   po EHDAA2:0000002 ! human embryo
    po EHDAA2:0001330 ! organ system group
     po EHDAA2:0001246 ! nervous system
      po EHDAA2:0000225 ! central nervous system
       po EHDAA2:0000183 ! brain *** 

This will create problems, as human brains will be inferred to end at the end of the embryo stage.

One solution is to treat each EHDAA2 class as equivalent to (UBERON:nnn and part_of some (embryo and part_of some Homo sapiens))

Another is to treat EHDAA2's 'human embryo' as being broader, i.e. an entire human.

cmungall commented 10 years ago

OK, here's another one:

In uberon, the pituitary is classified as a 'diencephalon gland' because the pituitary is considered a gland and part-of the diencephalon, with the parthood supported by ZFA and MA (I am not so sure, the ZFA def says it's ventral to the diencephalon).

Now in EHDAA2 the pituitary starts during CS12 and the 'diencephalon glands' (which are a mereological sum in EHDAA2) start at CS15

When you put this together you have the pituitary starting at both CS12 and CS15

Do you mind the report having logical inconsistencies like this?

This is an unavoidable consequence of going the logical route. It's powerful as it gives us a means to check consistency in a very strict way but some issues could take a lot of time and coordination to resolve.

The alternative is to just go with a more procedural route, e.g. use the direct assertions in EHDAA2 for human. I'm not sure I will have time to do this for your release.

cmungall commented 10 years ago

See the commit above for a first pass at a report, with some issues like the pituitary one. It's not hard to replicate this for other species and other sets of relationships.

The report should be interpreted as species specific. Thus if a row shows an UBERON ID in column 1, read this as the human-specific subtype.

The report uses 4 relations: 2 precise: starts during and ends during; and 2 bounding: starts during or after, ends during or before. In practice we rarely have anything for the last one.

The reasoning approach is powerful here - even though none of the cell types are annotated with stages, we can infer bounds based on develops-from and part-of relations (see RO for the property chain axioms).

So for example migratory cranial neural crest cell must start during or after CS11 in human (to get the full explanation its necessary to use a reasoner environment like Protege). But it's presumably due to the developmental lineage leading to something that explicitly starts here.

I'm not showing why 'feather' and other non-human classes are showing up. It's saying that if feathers were to exist in humans they would have to start at CS17 or later. But this should be flagged as unsatisfiable and not reported. Hmm.

cmungall commented 10 years ago

Unfortunately ZFA is a problem here:

 / ZFA:0100000 ! zebrafish anatomical entity
  is_a ZFA:0000037 ! anatomical structure [start: "Zygote:1-cell"]
   is_a ZFA:0001477 ! portion of tissue [start: "Blastula:High"]
    is_a ZFA:0001122 ! primary germ layer [start: "Gastrula:75%-epiboly"]
     is_a ZFA:0000016 ! ectoderm [start: "Gastrula:75%-epiboly"]
      is_a ZFA:0005463 ! oral ectoderm ***  [start: "Hatching:Long-pec"]

Properties are inherited over subclass, so everything inherits the property of starting at zygote stage.

Another way around this is to interpret their start not as RO:0002488 ! existence starts during but instead as RO:0002496 ! existence starts during or after

(I think this may been the result of a previous discussion about these relations in ZFA)

Same for end

cmungall commented 10 years ago

EMAPA has the following partonomy:

 / EMAPA:25765 ! mouse [is_a: "EMAPA:0"] [is_a: "anatomical structure"]
  po EMAPA:16042 ! extraembryonic component [ends_at: "TS26"] [part_of: "EMAPA:25765"] [part_of: "mouse"] [starts_at: "TS04"]
   po EMAPA:16076 ! amniotic fold ***  [ends_at: "TS10"] [part_of: "EMAPA:16042"] [part_of: "extraembryonic component"] [starts_at: "TS10"]

In uberon 'amniotic fold' is a subclass of extraembryonic component, so the starts and ends are inherited, causing inconsistency. The EMAPA class means the entirety of the extraembryonic material. Subtle issues like these...

cmungall commented 10 years ago

OK, have a look at the reports directory on github. Useful despite inconsistencies?

Unfortunately producing versions for nematode and fly with the vert terms stripped out is going to be harder. The reasoner based strategy we use to make tax subsets is having out of memory errors (Elk, 10G mem). Need to investigate other strategy.

fbastian commented 10 years ago

Sorry for my late reply, and thank you for your efforts. Will have a careful look.

Meanwhile, do you want me to run the reasoner based strategy on our servers, just to see how much memory it would require? I can easily try with up to 60G.

dosumis commented 10 years ago

Wrote this last week but for some reason reply by email didn't work for github this time:

I think that ZFA's approach makes sense, just not with the current OWL translation. In fact, i don't see how the current OWL translation could possibly work for mixing a class hierarchy with existence relations for anatomical strucs.

You get safe inference if you reverse the relationship (some ectoderm starts_during all 'gastrula:75% epiboly'). I think it also make sense to interpret the relation in the current direction as meaning 'starts during or after'.

(From recent commits it looks like you might be partially implementing this strategy)

fbastian commented 10 years ago

When you put this together you have the pituitary starting at both CS12 and CS15

and

Properties are inherited over subclass, so everything inherits the property of starting at zygote stage.

Could we have a simple filtering, like: when you have several possible stages for a boundary, consider the more granular one, or, if they are of same granularity, consider the one occurring earlier?

dosumis commented 10 years ago

Could we have a simple filtering, like: when you have several possible stages for a boundary, consider the more granular one, or, if they are of same granularity, consider the one occurring earlier?

"starts during or after" gives you that.

The inverted relationship (during all stage S, starts_to_exist some anatomical structure A) - means that at least one instance of A must start to exist during the stage specified.

cmungall commented 10 years ago

For ZFA, awaiting this: https://sourceforge.net/p/obo/zebrafish-anatomy-zfa-term-requests/125/

But will implement a hack for now

cmungall commented 10 years ago

OK, here's where we are. I have finished various ontology preprocessing hacks such that all external AO stage relationships are interpreted weakly (i.e starts during or after, ends during or before).

This means that 2 of the columns (the stricter ones) in the report are almost always blank, so I may just remove them.

I also added a few trivial upper level axioms - e.g. every anatomical structure exists during or after zygoate stage and during or before death.

One remaining issue.

here's a random example from the human report:

             ClassID: UBERON:0000084
          ClassLabel: ureteric bud
RO_0002488 existence starts during ID: 
RO_0002488 existence starts during Label: 
RO_0002492 existence ends during ID: 
RO_0002492 existence ends during Label: 
RO_0002496 existence starts during or after ID: HsapDv:0000024
RO_0002496 existence starts during or after Label: Carnegie stage 17 (human)
RO_0002497 existence ends during or before ID: HsapDv:0000025|UBERON:0000066|UBERON:0000071|HsapDv:0000022|HsapDv:0000021
RO_0002497 existence ends during or before Label: Carnegie stage 18 (human)|fully formed stage|death stage|Carnegie stage 15 (human)|Carnegie stage 14 (human)

the end stage inferences have some redundancies. We could filter these in an ad-hoc way, but I would rather do this using reasoning

The issue is that these are only formally redundant if we assert X1 precedes some X2 in the stage ontologies. This may seem redundant, as we have preceded_by. But in fact it's not (every larval stage succeeds a embryo stage but not every embryo stage is succeeded by a larval stage).

My proposed solution would be to assert reciprocals in species-specific stage ontologies, this is generally safe.

I will also assert some safe precedes relationships in Uberon - e.g. embryo stage precedes fully formed stage (one could argue that embryonic lethals invalidate this, but I would play the canonical card here)

cmungall commented 10 years ago

OK, bad news. I was getting some odd inferences for human, many things were inferred to end during the embryo stage. I tracked it down to:

[Term]
id: EHDAA2:0001330
name: organ system group
namespace: human_developmental_anatomy
is_a: CARO:0000011  ! anatomical system
relationship: ends_at CS20 ! CS20
***relationship: part_of EHDAA2:0000002 ! human embryo
relationship: starts_at CS06a ! CS06a

This is a known issue - in uberon we don't make a taxon equivalence axiom to this class (we use seeAlso) you still get bad inferences when you follow axioms in EHDAA2.

I can fix this either by forking the EHDAA2 github repo until there is a fix.

However it's becoming apparent that a reasoner-based strategy is somewhat fragile here (at least when we use strong axioms like taxonomic equivalence). This is the best long term strategy but it will take a lot of inter-ontology coordination to get right.

For bgee you might want to explore something more ad-hoc: use generic stages for uberon, and for a particular species, do the lookup in the external ontology, but ignore all the relationships in this ontology?

cmungall commented 10 years ago

here's the explanation FYI

screen shot 2014-06-25 at 5 03 44 pm

cmungall commented 10 years ago

I decided to make a fork of EHDAA2, it was just getting too hacky to have a script that transformed it. Made a pull request to see if they would accept: https://github.com/obophenotype/human-developmental-anatomy-ontology/pull/12

dosumis commented 10 years ago

Agree that asserting reciprocals in stage ontologies is the way to go. In fact, I already did this in the slice of FBdv used for reasoning about lethal phase. A simpler but more radical alternative would be to model stage series using individuals for life & all its stages. Pretty sure this would work. That way no reciprocals need be asserted, as long as a DL reasoner is used.

cerivs commented 10 years ago

I updated the ZFA xrefs in the SVN version of the ZFA like you requested. It will go into preversion next week. The start stage in ZFA means that under "normal" conditions the structure is evident at some time during the time period of the stage. If there is a discrepancy in the start time of a structure we choose the earlier start stage. If a structure develops from another structure we abut the end and start stages if the transformation occurs at the stage boundary and overlap the end and start stages if the transformation takes time, does not occur at a fairly precise time or if the time of the transformation. For annotations of developing structures if there is overlap curators choose the structure the researcher names. Curators follow the development tree to pick the right structure if the researcher uses sloppy terminology.

cmungall commented 10 years ago

David: I'm pretty gung-ho to try the individuals approach, as soon as Elk supports inferred object property assertions. Are there any species with forking paths though?

cmungall commented 10 years ago

Ceri: thanks for the clarification. Based on what you say these are indeed the correct relations to use.

As an aside, have you considered using transformation_of in ZFA? develops_from is fine, but sometimes it's useful to be more precise. It looks like there are 86 develops_from links in ZFA that temporally abut, e.g.

cmungall commented 10 years ago

See the 2014-06-26 release for the latest versions of the stage reports - in the /reports/ folder

fbastian commented 10 years ago

It looks great, thank you. 2 questions:

fbastian commented 8 years ago

Getting back to this issue, I have some questions:

cmungall commented 8 years ago

Is it correct that, if I wanted to generate the reports myself, I should use the code highlighted here: https://github.com/owlcollab/owltools/blob/master/OWLTools-Runner/src/main/java/owltools/cli/CommandRunner.java#L756, called from here: https://github.com/obophenotype/uberon/blob/master/Makefile#L286?

Yes. Note that the makefile will make some upstream targets, such as taxon-specific views

What are gciPropertyand gciFillers supposed to do? Should I care about that

You don't need to care here, because a taxon view has already been made. In these taxon views we make the assertion "everything is part of a fly" or similar, which has the effect of pushing-down taxon GCIs

It would be cleaner to do it in one step, by taking the complete composite-metazoan, running --export-parents with the gci property/filler to be part of some Insecta or similar, but there were efficiency issues

So I guess that the ontology source of reasoning should have been already filtered using the SpeciesSubsetterUtil for the proper taxon

Correct

Related to the SpeciesSubsetterUtil...

Can I get back to you on this one?

To clarify, what exactly means the relation existence starts during or after?

There is a bug in ontobee, docs should show here: http://purl.obolibrary.org/obo/RO_0002496

For now you have to look at ro.owl in protege.

http://purl.obolibrary.org/obo/RO_0002496 'existence starts during or after' x existence starts during or after y if and only if the time point at which x starts is after or equivalent to the time point at which y starts. Formally: x existence starts during or after y iff α (x) >= α (y).

http://purl.obolibrary.org/obo/RO_0002497 'existence ends during or before' x existence ends during or before y if and only if the time point at which x ends is before or equivalent to the time point at which y ends.

For instance, if I have a relation existence starts during or after CS17, it means that I have no guarantee that the structure actually exists at stage CS17, or CS18, but only the guarantee that it didn't exist before?

Correct

In the report, we sometimes have the end stage preceding the start stage, e.g

(THIS IS AN EDIT I MADE A MISTAKE IN MY ORIGINAL RESPONSE)

EHDAA2:0000845  interdigital region between fingers 1 and 2 epithelium (human)  HsapDv:0000024  CS17 (human)    HsapDv:0000023  Carnegie stage 16 (human)

This is biologically impossible. It's hard to get a reasoner to detect this with our class-level representation.

I'm trying to get an explanation in Protege but not having much luck...

cmungall commented 8 years ago

My strategy is to do a DL query

'existence starts during or after' some 'Carnegie stage 17 (human)' and 'existence ends during or before' some 'Carnegie stage 16 (human)'

and click the ? for an explanation

but it's crashing Protege/ELK

fbastian commented 8 years ago

Hmm, an "execute query" gives me owl:Nothing as a result. But I did it on a human subset produced from the composite-metazoan, so with FBbt classes, etc :p Could you please attach your source ontology?

These 'or after' and 'or before' are frustrating for me. I mentioned once to you my use case, to propagate calls of gene expression. E.g., If a gene is expressed in brain, we want to infer that it is expressed in all parent structures: nervous system, etc. But if we have a condition "brain" - "adult", then we will infer that the gene is expressed in, e.g., "embryo" at stage "adult", because "brain" is part of "embryo" at some point. We currently have such invalid conditions showing up in our analyses. I could do a filtering based on our annotations, but I would loose most of the richness of Uberon.

With 'or after' and 'or before', it means that I will still potentially propagate to invalid conditions. Do you think this yields lots of incorrectness, or do the relations most of the time point to the correct start stage? Maybe it is acceptable to treat these relations as "starts/ends at"...

cmungall commented 8 years ago

On 7 Oct 2015, at 13:27, fbastian wrote:

Hmm, an "execute query" gives me owl:Nothing as a result. But I did it on a human subset produced from the composite-metazoan, so with FBbt classes, etc :p Could you please attach your source ontology?

https://www.dropbox.com/sh/fdxwsdwl6bofamo/AABbx8xrv_yYtp-GD4VXKsRGa?dl=0

These 'or after' and 'or before' are frustrating for me. I mentioned once to you my use case,

Sorry I may not remember the exact details :-(

I think I have the gist of it

to propagate calls of gene expression. E.g., If a gene is expressed in brain, we want to infer that it is expressed in all parent structures: nervous system, etc. But if we have a condition "brain" - "adult", then we will infer that the gene is expressed in, e.g., "embryo" at stage "adult", because "brain" is part of "embryo" at some point. We currently have such invalid conditions showing up in our analyses. I could do a filtering based on our annotations, but I would loose most of the richness of Uberon.

I don't think you should get this inference: is this your own procedure or using an owl reasoner?

(or are you referring to the EHDAA2 issue, above? We can have a separate solution for this)

With 'or after' and 'or before', it means that I will still potentially propagate to invalid conditions. Do you think this yields lots of incorrectness, or do the relations most of the time point to the correct start stage? Maybe it is acceptable to treat these relations as "starts/ends at"...

It depends how you mean 'treats'. If it's just labeling a column, it may not matter. I think I need to understand more of how you are using this in propagation

Remember, there are a lot of lines like this:

CL:0000315      tears secreting cell    UBERON:0000106  zygote stage    
UBERON:0000071  death stage
UBERON:0035784  seminal clot    UBERON:0000106  zygote stage    
UBERON:0000071  death stage
UBERON:0014393  sweat of axilla UBERON:0000106  zygote stage    
UBERON:0000071  death stage

You have to treat these are being inclusive of down and upstream, otherwise they are plain wrong, because zygotes don't have sweaty pits!

Remember also the ZFA issue above.

One option is just to throw OWL reasoning out the window and implement some ad-hoc procedure. For example, you could say that for ZFA you will treat the relations as the more specific ones, but you will allow overrides over the is_a hierarchy. Basically you just propagate the most specific stage down, and tolerate some error.

I think if we do follow a reasoning approach we have to stop the bad inferences when we cross ontologies, when the ssAO is not quite an exact subclass (see this ticket above for examples). One way is to refine the bridging axioms, but you use a lot there. Another approach is more curation-heavy: we take control over the ss start and end relations, we curate them using precise starts-during and ends-during axioms (using large bins where appropriate of course). We could seed these from existing ontologies, but we would soon diverge. The file could be a separate one that bgee curators edit (either table or an owl bridge file to be edited in Protege). This might be a lot of work, and we would be somewhat duplicative of our friends curating the ssAOs. But if we don't do this there will simply be too much leakage for a pure logic approach to work, at least in the vertebrates.

fbastian commented 8 years ago

I don't think you should get this inference: is this your own procedure or using an owl reasoner?

Woops, just walking the graph without taking "existence" relations into account, I didn't think of it :/ I cannot use a reasoner on-the-fly for my application, I need to extract relevant information beforehand, and obviously I did it wrong.

OK, so you're telling me that I could use a reasoner to generate subsets valid at specific stages? Please show me a code snippet!

But you're also telling me that I will still have lots of false positives, and incorrect propagation of my data, right? We already discussed in our group the need for curating start/end stages in Uberon, and we would be willing to do it in mid/late 2016, if this is what it takes to propagate our data correctly (if we're still around here ofc ^^)

edit : I have an OutOfMemoryError when doing the DL query on your ontology :/

cmungall commented 8 years ago

OK, so you're telling me that I could use a reasoner to generate subsets valid at specific stages? Please show me a code snippet!

Did I say that? ...

So what you want is something like the existing report, but with intermediate stages (and superstages) filled in?

But you're also telling me that I will still have lots of false positives, and incorrect propagation of my data, right?

Umm, not sure exactly which option we're referring to here..

We already discussed in our group the need for curating start/end stages in Uberon, and we would be willing to do it in mid/late 2016, if this is what it takes to propagate our data correctly (if we're still around here ofc ^^)

You better be! Well I wouldn't want to make you do busy work. I think we can work something out that involves some precision curation and automating the rest

fbastian commented 8 years ago

Did I say that? ...

So what you want is something like the existing report, but with intermediate stages (and superstages) filled in?

Yes, but I think I can fill the gaps easily from preceding relations, etc. So, no "reasoner magic" on this one? ^^

OK, I realize that none of your examples is a problem for our expression propagation, because they have no substructures. So, I'm clarifying my use case:

My problem is when a structure continues to exist, while some of its superclasses are supposed to disappear at some point, but we don't know when; or, a structure starts to exist before some of its superclasses, but again, we don't know when the superclass appears. A gross example would be: brain part_of embryo and brain part_of adult organism; embryo and adult organism having incorrect start/end stages (zygote to death stage); If I have a call of expression in brain at stage adult => gene notably considered expressed in embryo at stage adult; call of expression brainat stage embryo => gene considered expressed in adult organism at stage embryo.

Conclusion: incorrect start/end stages are not an issue for structures that have no subclasses crossing their stage existence boundaries, i.e., all substructures exist only during the existence of their superclass (because, obviously, we wouldn't have an annotation on a structure at a stage it doesn't exist at, to be incorrectly propagated to its superclass).

Another reason for our incorrect propagated calls is the problem of annotating gestational structure to the mother's age (BgeeDB/expression-annotations/issues/42). We have annotations to "placenta" or "decidua" at stage "adult", thus propagating to "embryo" at stage "adult". For those it is always bad to have incorrect start/end stages for the superclasses, but it is more a problem with how we annotate these structures than a problem with Uberon. But still, correct start/end stages would help.

This being said, it seems like your start/end stages are pretty correct regarding structures having some subclasses crossing their stage boundaries. What is your opinion? Should I just go for it and hope for the best? :p

(On a side note, we would be interested in getting correct start/end stages in Uberon anyway, because we propagate some information from superclass to subclasses, and then we're getting into a lot of troubles... ;) )

fbastian commented 8 years ago

Sorry, but I need more info :p

gouttegd commented 3 years ago

WARNING: This issue has been automatically closed because it has not been updated in more than 3 years. Please re-open it if you still need this to be addressed addressed addressed – we are now getting some resources to deal with such issues.