monarch-initiative / phenio

An integrated ontology for Phenomics
https://monarch-initiative.github.io/phenio/
BSD 3-Clause "New" or "Revised" License
6 stars 0 forks source link

Create a include or exclude list for properties #27

Open matentzn opened 1 year ago

matentzn commented 1 year ago

Phenio contains some relationships like http://purl.obolibrary.org/obo/emapa#is_a which is really confusing. These should be removed prior to release.

julesjacobsen commented 1 year ago

So, what is the correct URI for is_a? I couldn't find anything in OLS.

matentzn commented 1 year ago

rdfs:subClassOf

julesjacobsen commented 1 year ago

So, would rdfs:subClassOf be considered the CURIE for https://www.w3.org/TR/rdf-schema/#ch_subclassof once expanded and is_a being a synonym? Just wondering how this fits with the Node model in obographs.

matentzn commented 1 year ago

I think rdfs:subClassOf is considered a "built-in" and would probably not be represented in obographs at all as a node. Its a good questions though, the distinction is sort of arbitrary. Why do you need to know a CURIE/IRI for isa in obographs? OAK has an obographs to OWL mappings which handles all this expansion..

julesjacobsen commented 1 year ago

This is indeed how it is (once http://purl.obolibrary.org/obo/emapa#is_a is removed) - there is zero mention about the source of 'is_a', yet it is the most commonly referenced predicate. Practically all other predicates used in an Edge are URIs declared with their label as a Node (either as a CLASS or PROPERTY type), so if you're doing the naive thing of using URIs to look-up a Node it fails here because is_a is never declared anywhere!

e.g.


GraphDocument graphDocument = openGraphDocument("phenio.json");
Graph phenio = graphDocument.getGraphs().get(0);

// create a map of Id: Node, where Id is a URI String
Map<String, Node> nodes = phenio.getNodes().stream()
                .map(node -> Node.of(node.getId(), node.getLabel()))
                .collect(Collectors.toMap(Node::getId, Function.identity()));

// special case it seems
Node isA = nodes.values().stream()
        .filter(node -> node.getLabel().equals("is_a"))
        .findFirst()
        // so this would be where to put rdfs:subClassOf - perhaps that ought to be the URI and keep `is_a` as the label?
        .orElse(Node.of("http://purl.obolibrary.org/obo/emapa#is_a", "is_a"));

phenio.getEdges().stream()
                .forEach(edge -> {
                    Node subject = hpNodes.get(edge.getSub());
                    Node object = hpNodes.get(edge.getObj());
                    // annoyingly we can't treat the predicate in a consistent fashion due to
                    // 'is_a' being an undeclared, implicit 'primitive' type 
                    Node predicate;
                    if (edge.getPred().equals("is_a")) {
                        predicate = isA;
                    } else {
                        predicate = hpNodes.get(edge.getPred());
                    }
                 })
matentzn commented 1 year ago

cc @cmungall

cmungall commented 1 year ago

The focus of this issue is predicates like http://purl.obolibrary.org/obo/emapa#is_a which seems like a data bug. I don't think this has anything to do with obojson.

I checked the latest emapa.owl and it is present but only as a declaration

✗ grep is_a db/phenio.owl | grep emapa
    <!-- http://purl.obolibrary.org/obo/emapa#is_a -->
    <owl:ObjectProperty rdf:about="http://purl.obolibrary.org/obo/emapa#is_a">

unfortunately there is no way to tell from the OWL where this comes from but a good guess is emapa itself:

✗ curl -L -s $OBO/emapa.owl  | grep emapa#is_a
    <!-- http://purl.obolibrary.org/obo/emapa#is_a -->
    <owl:ObjectProperty rdf:about="http://purl.obolibrary.org/obo/emapa#is_a">

I know the genesis of these things, twenty years ago someone declared is-a in oboedit even though they didn't need to and it is sticking around ever since.

This one is harmless as it's just a declaration that is not used. Of course we should still report upstream and possibly fix, and we should do more QA/QC on ontologies we bring in.

But there are worse issues. emapa isn't using the standard part-of predicate (BFO:0000050). It is using http://purl.obolibrary.org/obo/emapa#part_of

This means that partonomy queries on EMAPA will yield massively incomplete results. And EMAPA is essentially a partonomy, there is minimal info in subclassing.

I suggest a strategy:

We should have a monarch-wide simple profile that should be satisfied

Pretty much everything else can be ignored

cmungall commented 1 year ago

I don't think obojson is relevant to this issue at all, but regarding the question, yes is_a is the hardcoded value for rdfs:subClassOf between two named classes.