monarch-initiative / phenol

phenol: Phenotype ontology library
https://phenol.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
23 stars 4 forks source link

environmental exposure obo parsing exception #163

Closed kingmanzhang closed 5 years ago

kingmanzhang commented 5 years ago

org.monarchinitiative.phenol.base.PhenolException: Did not recognize RelationshipType: http://purl.obolibrary.org/obo/RO_0000087

at org.monarchinitiative.phenol.ontology.data.RelationshipType.fromString(RelationshipType.java:103)
at org.monarchinitiative.phenol.io.obo.OboOntologyLoader.convertEdgesToRelationships(OboOntologyLoader.java:265)
at org.monarchinitiative.phenol.io.obo.OboOntologyLoader.load(OboOntologyLoader.java:97)
at org.monarchinitiative.phenotefx.io.EctoOboParser.parse(EctoOboParser.java:31)
at org.monarchinitiative.phenotefx.io.EctoOboParserTest.parse(EctoOboParserTest.java:15)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
at com.intellij.rt.execution.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:47)
at com.intellij.rt.execution.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:242)
at com.intellij.rt.execution.junit.JUnitStarter.main(JUnitStarter.java:70)
kingmanzhang commented 5 years ago

ecto.obo:

https://raw.githubusercontent.com/EnvironmentOntology/environmental-exposure-ontology/master/ecto.obo

pnrobinson commented 5 years ago

Currently phenol hard codes relation types with the intent of doing some minimum Q/C, so that all of the relation types are at least in principle computable (the alternative would be to create a relation on the fly or to map all relations to some generic relation). The disadvantage is that phenol crashes if it gets a relation it has never seen. For now, we can just add this relation, but perhaps there is a better strategy moving forward.

kingmanzhang commented 5 years ago

Okay, I can add it tomorrow.

cmungall commented 5 years ago

I would separate QC from parsing. Harcoding a list of expected relations will be too fragile

On 10 Dec 2018, at 23:48, Peter Robinson wrote:

Currently phenol hard codes relation types with the intent of doing some minimum Q/C, so that all of the relation types are at least in principle computable (the alternative would be to create a relation on the fly or to map all relations to some generic relation). The disadvantage is that phenol crashes if it gets a relation it has never seen. For now, we can just add this relation, but perhaps there is a better strategy moving forward.

-- You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub: https://github.com/monarch-initiative/phenol/issues/163#issuecomment-446003855

pnrobinson commented 5 years ago

@cmungall I agree. Do you think it would be a better plan to essentially store a few core relations such as IsA during parsing and to create a map of additional relations? These could then be Q/C'd by classes such as HpoOntologyParser, MondoOntologyParser etc (or perhaps even better HpoOntology, MondoOntology etc).

kingmanzhang commented 5 years ago

Well, there are many relationship types in ecto not captured in the current enum class. It does not sound a good plan to add them one by one, so I will wait for a more permanent fix is made.

cmungall commented 5 years ago

How about a configuration object that allows specification of which relations phenol cares about, with defaults set to SubClassOf (not an object property, but an OWL predicate), part of, and we can add others. After the initial parse a filter could be applied.

I recommend removing all enums and making this more configuration-driven.

pnrobinson commented 5 years ago

Chris, so do you mean the configuration object would be passed to the parser? What would the desired behaviour be?

  1. discard relations that are not mentioned in the config?
  2. alternatively, throw an exception?
  3. alternatively, map unknown relations to some generic relation type?

My preference would be (2) but to allow the configuration object to do 1,2, or 3 depending on the config. The classes such as HpoParser would have default Config objects that would do the right thing in 99% of cases, but power users could roll their own by using an optional constructor like

HpoParser(String/File/Stream input, OboParseConfig config)

Is this roughly what you are suggesting?

cmungall commented 5 years ago

I think this makes sense, I would make 1 the default as you want to retain the ability to add new relationship types without making new code releases. I'd also abstract above an OboParseConfig to something like an OntologyConfig

pnrobinson commented 5 years ago

Currently, what happens is that the OboOntologyLoader handles Relations like this:

RelationshipType reltype = RelationshipType.fromString(edge.getPred());
Relationship relationship = new Relationship(subjectTermId, objectTermId, edgeId, reltype);
relationships.add(relationship);

TheRelationshipType enum looks like this

public enum RelationshipType {
  /** "Is-a" relation. */
  IS_A("is_a"),
  /** "Intersection-of" relation. */
  INTERSECTION_OF("intersection_of"),
  /** "Union-of" relation. */
  UNION_OF("union_of"),
(...etc for currently about 25 relations...)

and

static public RelationshipType fromString(String reltype) throws PhenolException  {
    switch (reltype) {
      case "is_a":
        return IS_A;
      case "http://purl.obolibrary.org/obo/BFO_0000050":
      case "part of":
      case "part_of":
        return PART_OF;
      case "http://purl.obolibrary.org/obo/BFO_0000051":
      case "has part":
        return HAS_PART;
      case "http://purl.obolibrary.org/obo/RO_0002211":
      case "regulates":
        return REGULATES;
(...)

If we want to do this with a Config object that will be good for previously unknown relationship types, we will probably need to refactor this design pattern--we can/should not subclass an enum for instance.

The goal is to have a computable representation of the relations (e.g., annotation propagation for IS_A but not for REGULATES). This will need some thought.

cmungall commented 5 years ago

I don't understand the enum, I think this is a hangover from the original ontolib code, looks like it's based on a big misunderstanding of obo format, intersection_of is not a relation, neither is union_of. These are OWL constructs for building up class expressions (I noted this back in #9).

I would definitely remove a lot of the hardcoding. The actual patterns used depend on how you see graph operations done in Phenol. I can sketch out two overall patterns, the first is how we approach things in ontobio, the second hardcodes a bit more and I think is more what your aiming for.

Ontobio approach

(I say ontobio, but at a general level this is the approach taken in the SciGraph neo4j mapping, in OLS, in tons of other libs...)

The goal is to create a generic Graph object (here networkx, but could be any) where the edges are labeled with relation IDs. The relation IDs are also nodes in the graph and can have metadata such as label (e.g. "part of"); but he graph operations largely ignore what the edge label is, if you want ancestors, then it uses all edges by default. The user can choose to create subgraphs based on a set of relationship types - e.g. I may want to use subClassOf and BFO:0000050 in an anatomy graph. There is virtually no hardcoding of any knowledge about relationship types in the core code, but application specific code may have methods to create e.g. a GO subgraph that preserves subClassOf/part of/regulates.

This provides open-ended capabilities for any annotations with any ontologies but gives the user a lot of rope to hang themselves. Someone may naively parse some OWL from a highly expressive ontology into a graph and naively do graph operations, and end up making hairballs and doing unintentional things like propagating a disease to a CHEBI role based on HPO logical definitions. Instead such a use should either take a "basic" version of an ontology (we provide these on the OBO site for a number of ontologies), or just use provided graph filtering methods.

E.g. for many HPO operations people will want to just use the subClassOf graph (though we want to avoid hardcoding this too much if we are to ever move away from is-a overloading, cc @dosumis).

We can also provide some ready made helper static classes, e.g "makeMeAFoolproofHPOGraphFromOboOrOwl", this is what I had in mind with the configs, this would just use the generic file->graph methods and then trim certain edges (and nodes) out the graph.

This kind of wild-west graph theoretic approach that ignores OWL semantics and RO semantics is anathema to my "OWL head" but this is the dominant paradigm in bioontologies and it is pretty useful to be able to do things with arbitrary ontologies using a graph datamodel, and not have the library have any built in preconceptions about what you should and shouldn't do.

If phenol follows this approach then parsers don't even need a config object with relation IDs parsed in, this is treated as a separate concern and can be dealt with by a subgraph library.

Manifesting some semantics at the java level

Phenol may desire to trade some flexibility to provide baked in semantics of some of the used ontologies at the java class model level. This kind of seems what Phenol is doing, but I'm not sure how much of that is by current intent and how much is legacy of the original inherited code, where we had lots of built-in assumptions about ontologies (#7).

I can see some merit in baking in some assumptions e.g. about subClassOf, and about how diseases may propagate up the HPO graph. But TBH this is not my natural mode of thinking and I give advice it may end up as a weird chimera. @julesjacobsen what are your thoughts here?

OWL-centric approach

I mention for completeness, but another approach is to stay purely in the OWL universe and only use sanctioned entailed SubClassOfs, making use of class expressions as necessary, but I think this makes the lib pretty hard for an average bioinformatics person to use.

pnrobinson commented 5 years ago

@kingmanzhang I think that this may require a bit of thinking. Can you post the ECTO relations here so that we can figure out a way forward? I do not think we want to make phenol to be a general OBO or OWL library -- it is basically a tool for our projects and anybody else who thinks it useful. Currently this works for HP and MP, and we should maybe use this as a test case to see if we can easily extend the ontologies we need for CD project.

kingmanzhang commented 5 years ago

I don't have a list. You can use protege to open the ECTO file and read the relationship classes. There are A LOT classes, most are absent from phenol.

pnrobinson commented 5 years ago

I just had a closer look and I think that we do not need to keep the ChEBI etc classes, but instead just the ECTO classes (e.g., id: ECTO:0000360) and as far as I can see, all of these classes just have is_a relationships. The other relationships are used in the logical definitions e.g.

id: ECTO:0000360
(...)
intersection_of: ExO:0000002 ! exposure event
intersection_of: RO:0002309 CHEBI:3015 ! has exposure stimulus benomyl

(the "intersection_of) lines are OBO's way of representing OWL definitions. For the disease definitions themselves, we will rely on just the pure ECTO terms, and so I would suggest that we implement an ECTO parser that discards everything except the ECTO terms, and this should also mean that we only have is_a relations to worry about.

cmungall commented 5 years ago

+1 to Peter's comments

pnrobinson commented 5 years ago

@cmungall @julesjacobsen @kingmanzhang The issue is that currently the OWLOntologyLoader needs to have all of the relations in order to parse. We could

  1. Introduce a boolean option for "tolerant" parsing (this would skip all other relations while parsing ECTO)
  2. Subclass the OntologyParser/Loader for ECTO to do the right thing
  3. Add the 7 relations for now and figure things out later

@julesjacobsen Are you currently working on these classes? Do you see an easy solution? What about solutiuon #1 for now? Should we code this or would this conflict with what you are doing?

cmungall commented 5 years ago

I think the tolerant parsing is the way to go

julesjacobsen commented 5 years ago

@pnrobinson Yes, working on this right now! Hold-on for a while and I'll merge my changes into development and then take a look at this issue. There will be horrible conflicts if you forge ahead without waiting.

pnrobinson commented 5 years ago

we are baiting with weighted breath, no worries!

julesjacobsen commented 5 years ago

There are two options right now:

The simplest one is to replace the exception with the default RelationshipType.UNKNOWN this is great as it fixes loading the ontology, but I've no idea how sensible the output is. This could be tweaked a bit so that with this type relationships are removed, but again, I've no idea what the resulting graphs will look like.

Alternatively, would it work if we had a set of known TermId instead of an Enum? We could still have a set of constants for known and loved relationships e.g. PART_OF = TermId.of("BFO:0000050") and unknown ones would still be present creating a valid graph as originally imported. But there seems to be relationships like is_a which have no id, which seems wrong.

pnrobinson commented 5 years ago

@julesjacobsen I suspect that for the phenotype ontologies, there will not be many relatons outside of is_a that we will care about in phenol. Parsing MONDO and NCIT is already slight scope creep (but convenient). For ECTO, we could use UNKNOWN to avoid crashing, and then remove all terms that are not ECTO, and all remaining relations would be ISA. I do not forsee that we will want to parse many other ontologies (this week, haha), and so with that our acute issues should be solved ?!?

dosumis commented 5 years ago

On Tue, Jan 8, 2019, 5:30 PM Peter Robinson notifications@github.com wrote:

@julesjacobsen https://github.com/julesjacobsen I suspect that for the phenotype ontologies, there will not be many relatons outside of is_a that we will care about in phenol.

Don't know enough about Phenol to comment specifically on that, but as part of uPheno work we will be pushing strongly to reduce is_a overloading. This will likely result in the replacement of some is_a links with causal relationships.

Parsing MONDO and NCIT is already slight scope creep (but convenient).

For ECTO, we could use UNKNOWN to avoid crashing, and then remove all terms that are not ECTO, and all remaining relations would be ISA. I do not forsee that we will want to parse many other ontologies (this week, haha), and so with that our acute issues should be solved ?!?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/phenol/issues/163#issuecomment-452383774, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG4x315gzhwdBj06WHnwXtOYqT6GrNoks5vBNWqgaJpZM4ZMR-a .

pnrobinson commented 5 years ago

I am not sure I am on board with adding causal relationships between phenotypes? This issue is not the place for it but can you present these plans at some future monarch call?

dosumis commented 5 years ago

On 8 Jan 2019, at 18:36, Peter Robinson notifications@github.com wrote:

I am not sure I am on board with adding causal relationships between phenotypes? This issue is not the place for it but can you present these plans at some future monarch call?

Sure. It’s an issue we can’t avoid though. The incorrect use of is_a as a generic grouping relation in phenotype ontologies is a serious source of bogus inferences. We’re partially insulated from these by hacks in our design patterns, but this kind of fix is not ideal.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/phenol/issues/163#issuecomment-452405213, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG4x1jMdbvyueJ0Ves8Qjqo-kHlmyo7ks5vBOU_gaJpZM4ZMR-a.

cmungall commented 5 years ago

This is the discussion here: https://github.com/obophenotype/upheno/issues/291

julesjacobsen commented 5 years ago

@kingmanzhang Are you writing a new ECTO parser?

julesjacobsen commented 5 years ago

@pnrobinson As a general solution would it work if the parser supplied a set of allowed curie prefixes to the OboGraphDocumentAdaptor e.g. GO, HPO, MP, ECTO and always removed RelationshipType.UNKNOWN?

Currently all RO and BFO terms are removed by default in the OboGraphDocumentAdaptor

https://github.com/monarch-initiative/phenol/blob/29ce11d2481ea1d3a8ae2ee73f74c5410f42bc68/phenol-io/src/main/java/org/monarchinitiative/phenol/io/obo/OboGraphDocumentAdaptor.java#L136-L138

but instead, if this only allowed expected curies through it would be far easier for the parsers to request what they wanted.

pnrobinson commented 5 years ago

I see -- so the parser would be like this?

public MyOntologyParser() {
  Set<String> acceptableTermPrefixes=ImmutableSetof("ECTO");
  List<TermId> acceptableRelations=ImmutableSet.of(TermId.of("RO:3")); // plus IS_A, PART_OF by default

....

}

// another constructor where client can change prefixes...
MyOntologyParser(List<String>,Set<TermId>) {
//...
}
julesjacobsen commented 5 years ago

I'm suggesting, for the time being at least, to keep the RelationshipType enum but always return RelationshipType.UNKNOWN instead of throwing an exception. However, if we only ever allow expected curies through it these relations would never be created anyway.

So a parser might look like something like this:

public GoOntology parse() throws PhenolException {
    OntologyLoader loader = new OntologyLoader(obo);
    Ontology ontology = loader.loadOntologyWithPrefixes("GO");
    if (debug) {
      System.err.println(String.format("Parsed a total of %d GO terms",ontology.countAllTerms()));
    }

    return new GoOntology(
      (ImmutableSortedMap<String, String>) ontology.getMetaInfo(),
      ontology.getGraph(),
      ontology.getRootTermId(),
      ontology.getNonObsoleteTermIds(),
      ontology.getObsoleteTermIds(),
      (ImmutableMap<TermId, Term>) ontology.getTermMap(),
      (ImmutableMap<Integer, Relationship>) ontology.getRelationMap());
  }
pnrobinson commented 5 years ago

this seems like a reasonable step! We can extend it later in a similar way for Relations if there is a need to?

julesjacobsen commented 5 years ago

Can I also suggest we get rid of mostly useless classes like GoOntology and MpOntology as these add nothing to ImmutableOntology apart from giving them a generic type. There is no behavioral difference and the MpOboParser, apart from being unused doesn't even return an MpOntology. These are all wasted classes. The only one with any difference is the HpoOntology, which arguably shoudn't exist either as there is only the getPhenotypicAbnormalitySubOntology() method used in the PrecomputScoresCommand. Should this not be a general operation which can be called if required?

Then we can load any ontology by simply calling:

OntologyLoader loader = new OntologyLoader(obo);
Ontology goOntology = loader.loadOntologyWithPrefixes("GO");
Ontology hpoOntology = loader.loadOntologyWithPrefixes("HPO");
Ontology ectoOntology = loader.loadOntologyWithPrefixes("ECTO");
Ontology mergedOntology = loader.loadOntologyWithPrefixes("HPO", "MP", "UBERPHENO");
//  etc...
pnrobinson commented 5 years ago

I agree that we do not need these classes. We could also replace getPhenotypicAbnormalitySubOntology() with

TermId PHENOTYPIC_ABNORMALITY = TermId.of("HP:0000118");
Ontology phenoOntology = ontology.getSubontology(PHENOTYPIC_ABNORMALITY);

or something like that. Or maybe something like

Ontology parseFromRoot(TermId subontologyRoot) {
 // just return terms that descent from subontologyRoot
}
julesjacobsen commented 5 years ago

Excellent - in the interests of not biting off too much in one go I'll go with the first option as this is a minimal change.

kingmanzhang commented 5 years ago

Are you writing a new ECTO parser? @julesjacobsen No, feel free to create it. If you want me to do it, shoot me a message when the refracting of other ontologies and parsers are completed so that I can be consistent.

julesjacobsen commented 5 years ago

Actually I didn't want to create one as I've removed all the ontology typed parsers.

I've refactored things so that now you can load any ontology like this:

File ectoFile ...
// Root term id is  XCO:0000000
Ontology permissiveOntology = OntologyLoader.loadOntology(ectoFile);

// Throws exception - No root term found
Ontology ectoOnly = OntologyLoader.loadOntology(ectoFile, "ECTO");

// Root term id is UBERON:0000000
Ontology ectoWithUberonRoot = OntologyLoader.loadOntology(ectoFile, "ECTO", "UBERON");

// Root id is artificial term RO:0000000
Ontology ectoWithRoRoot = OntologyLoader.loadOntology(ectoFile, "ECTO", "RO");

The only issue is that, with ECTO at least calling OntologyLoader.loadOntology(ectoFile, "ECTO"); results in an error as there are no ECTO root candidates.

@pnrobinson If this is acceptable then I think I'm finished.

cmungall commented 5 years ago

The only issue is that, with ECTO at least calling OntologyLoader.loadOntology(ectoFile, "ECTO"); results in an error as there are no ECTO root candidates

There's a previous ticket about this but we should be precise when using concepts like root. Currently the root node in ECTO is an ExO class. I think it may be more straightforward if we claim an ECTO class here. Nevertheless, we shouldn't assume the root node in an ontology is always in the ID space of that ontology.

julesjacobsen commented 5 years ago

Looking at this a bit more it's actually an issue caused by the lack of an ECTO mapping in the curie_map.yaml. Once this is manually hacked-in the graph ends up as expected - only ECTO terms connected by IS_A relationships and an artificial ECTO:0000001 root.

Loading the graph in a permissive manner results in the correct ExO root.

julesjacobsen commented 5 years ago

@cmungall I'm not really sure what the eventual aim here is, but the current implementation will look for the first term with no incoming relationships and call that the root. If there are multiple roots a fake one is made to fit the assumption there can only be one root node and this is in the same ID space as the others.

julesjacobsen commented 5 years ago

This is the new implementation for loading ontologies, if you like the look of this I'll push it. It's a big change as the generic typed ontologies and parsers have all gone and are replaced with this:

// The HPO is in the default  curie map and only contains known relationships / HP terms
Ontology hpoOntology = OntologyLoader.loadOntology(hpoFile)

// The GO is in the default curie map but also contains BFO and RO terms with unknown relationships
// we want to ignore these so here we specify the term prefixes we want to use. It has three possible root
// nodes (biological_process, cellular_component, biological_function) so an artificial GO:0000000 root is
// added
Ontology goOntology = OntologyLoader.loadOntology(goFile, "GO")

// ECTO isn't mapped in the default Curie mappings, so we need to add it here (the PURL isn't correct)
CurieUtil curieUtil = CurieUtilBuilder.withDefaultsAnd(ImmutableMap.of("ECTO", http://purl.obolibrary.org/obo/ECTO_"));

// ECTO also contains a bunch of unknown relationships so we're going to simplify this graph by only
// loading ECTO nodes (this ignores the true root term XCO:0000000) and other nodes from CHEBI,
// BFO and UBERON among others.
Ontology ecto = OntologyLoader.loadOntology(ectoFile, curieUtil, "ECTO");

ecto.getRelationMap().values().forEach(relationship -> assertEquals(RelationshipType.IS_A, relationship.getRelationshipType()));

assertEquals(TermId.of("ECTO:0000000"), ecto.getRootTermId());
assertEquals(2271, ecto.countNonObsoleteTerms());
assertEquals(0, ecto.countObsoleteTerms());
pnrobinson commented 5 years ago

Looks great to me!

cmungall commented 5 years ago

Awesome!

I'd recommend against injecting IDs, for all you know GO:000000 may clobber an existing class. If you need a take root, use owl:Thing or invent something that won't clash.

julesjacobsen commented 5 years ago

@cmungall nice suggestion - in order to keep things OBOish and in the same idspace would IDSPACE:Root work e.g. ECTO:Root or could this cause issues? For the moment though I've used owl:Thing.

julesjacobsen commented 5 years ago

see e7531dadf4e732f4f90969b27cf8be603e1ed936 and 7e09baa

kingmanzhang commented 5 years ago

The new parser is working now. Thanks to everyone, especially @julesjacobsen ! @pnrobinson I have updated PhenoteFX. Other apps that depend on phenol also need refactoring to use the general ontology and loader class.

pnrobinson commented 5 years ago

sounds great! I will start to refactor the other apps!