openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

/tree enzyme & chebi root calls fail in 2.2 #387

Closed jakhag closed 7 years ago

jakhag commented 7 years ago

Enzyme queries seem to fail. Interestingly, the child/parent hierarchy calls work, but the root call is not showing enzyme and chebi.

http://alpha.openphacts.org:3002/tree?app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1 

In 2.2 (alpha): 404 Not Found http://alpha.openphacts.org:3002/tree?app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&root=enzyme

In 2.1 (beta): https://beta.openphacts.org/2.1/tree?app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&root=enzyme

randykerber commented 7 years ago

The first query (/tree) fails in all combos of v2.1 and v2.2 IMS and Virtuoso.

The second query (/tree?root=enzyme) works with v2.1 Virtuoso fails with v2.2 Virtuoso.

randykerber commented 7 years ago

The SPARQL query for second LDA query returns no results. The error message is:

This document is empty and basically useless. It is generated by a web service that can make some statements in HTML Microdata format. This time the service made zero such statements, sorry.

Here's the SPARQL:

PREFIX ops: <http://www.openphacts.org/api#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX uniprot: <http://purl.uniprot.org/core/>
CONSTRUCT { ops:conceptHierarchy dcterms:hasPart ?g_short .
?g_short ops:rootNode ?root_node .
?root_node skos:prefLabel ?name .
<http://purl.uniprot.org/enzyme> skos:prefLabel 'Enzyme Classification' .
<http://www.ebi.ac.uk/chembl/target> skos:prefLabel 'ChEMBL Target Hierarchy' .
<http://www.ebi.ac.uk/chebi> skos:prefLabel 'ChEBI Ontology' .
<http://www.geneontology.org> skos:prefLabel 'GeneOntology' .
<http://www.bioassayontology.org> skos:prefLabel 'BioAssayOntology' .
<http://purl.obolibrary.org/obo/doid> skos:prefLabel 'Human Disease Ontology' .
 }  WHERE { VALUES ?g_short { <http://purl.uniprot.org/enzyme> } {
SELECT DISTINCT ?root_node ?g_short WHERE { VALUES ?g_short { <http://purl.uniprot.org/enzyme> }
 VALUES ?g {
 <http://purl.uniprot.org/enzyme/direct>
 <http://www.ebi.ac.uk/chembl/target/direct>
 <http://www.ebi.ac.uk/chebi/direct>
 <http://www.geneontology.org>
 <http://www.bioassayontology.org>
 <http://purl.obolibrary.org/obo/doid>
 }
 GRAPH ?g {
 [] rdfs:subClassOf ?root_node .
 MINUS {?root_node rdfs:subClassOf []}
 FILTER ( isURI(?root_node) )
 BIND (IF(?g = <http://purl.uniprot.org/enzyme/direct>, IRI(<http://purl.uniprot.org/enzyme>) ,
 IF(?g = <http://www.ebi.ac.uk/chembl/target/direct>, IRI(<http://www.ebi.ac.uk/chembl/target>) ,
 IF(?g = <http://www.ebi.ac.uk/chebi/direct>, IRI(<http://www.ebi.ac.uk/chebi>) ,
 IF(?g = <http://www.geneontology.org>, IRI(<http://www.geneontology.org>) ,
 IF(?g = <http://www.bioassayontology.org>, IRI(<http://www.bioassayontology.org>) ,
 IF(?g = <http://purl.obolibrary.org/obo/doid>, IRI(<http://purl.obolibrary.org/obo/doid>), 'Error')))))) AS ?g_short )
 }
}
}
{
 ?root_node rdfs:label ?name
}
UNION {
 ?root_node skos:prefLabel ?name
}
MINUS { ?root_node uniprot:obsolete true }
 }
ianwdunlop commented 7 years ago

On alpha /tree seems to have all the hierarchies except enzyme (uniprot):

I guess that is why the root=enzyme query fails

ianwdunlop commented 7 years ago

The query seems to expect these triple patterns where ?root_node is <http://purl.uniprot.org/core/Enzyme>: <http://purl.uniprot.org/core/Enzyme> rdfs:label ?name . <http://purl.uniprot.org/core/Enzyme> skos:prefLabel ?name . but they don't exist. Maybe they need to be manually created and added to the triple store.

Chris-Evelo commented 7 years ago

I understood that it was decided to only load the part of UniProt that is really being used (even though we always said having all of it is an advantage since you can easily add new calls for say e.g. PPIs when included). Maybe this part was simply not loaded?

ianwdunlop commented 7 years ago

Not sure @Chris-Evelo . If anyone has access to the beta sparql endpoint they could try running the following to see what the labels should be:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
select distinct * where {
  <http://purl.uniprot.org/core/Enzyme> rdfs:label ?label .
  <http://purl.uniprot.org/core/Enzyme> skos:prefLabel ?pref_label .
}
randykerber commented 7 years ago

I'm not seeing any triples on beta (v2.1) or alpha (v2.2) matching pattern: <http://purl.uniprot.org/core/Enzyme> ?p ?o ..

beta = http://beta.openphacts.org:3003/sparql alpha = http://alpha.openphacts.org:8890/sparql

danidi commented 7 years ago

The strange thing is, if you run the child query (http://alpha.openphacts.org:3002/tree/children?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F1.-.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&_format=json), the labels are there. Also with the parents query, all top level class labels are shown (e.g. http://alpha.openphacts.org:3002/tree/parents?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&_format=json).

Was there any change in the SPARQL query from 2.1 to 2.2?

randykerber commented 7 years ago

@danidi -- Far as I can tell no change to the "/tree" SPARQL since before 2015.

ianwdunlop commented 7 years ago

So there is something wrong with the enzyme hierarchy on alpha. Have a look at this query:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT distinct * WHERE {
  <http://purl.uniprot.org/enzyme/1.-.-.-> ?p ?o .
} 
LIMIT 100

on alpha and beta

If you look closely you will notice that on alpha

<http://purl.uniprot.org/enzyme/1.-.-.-> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.uniprot.org/enzyme/1.-.-.-> 

and

<http://purl.uniprot.org/enzyme/1.-.-.-> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.uniprot.org/core/Enzyme>

but on beta it is not. I'm sure that it shouldn't be a subclass of itself, whether it should be a subclass of Enzyme I'm not sure but on beta it implies not. As you know you can go RDF blind looking at this stuff so please check for yourself.

ianwdunlop commented 7 years ago

If you look at the original query a few comments above it doesn't allow the root nodes to be a subClassOf anything MINUS {?root_node rdfs:subClassOf []}. So it looks like the enzyme root nodes have some rogue statements.

randykerber commented 7 years ago

Technically, in the land of RDF and OWL, (if I'm remembering right) all Classes are rdfs:subClassOf themselves. Though the name is "subClassOf", it really means "subClass of or equivalent to". But I imagine no one wanted such an awkward name.

randykerber commented 7 years ago

URIs like <http://purl.uniprot.org/enzyme/1.14.-.-> I think are supposed to represent "classes" of Enzymes rather than instances of Enzymes, right?

So maybe saying it's rdf:type=Enzyme is incorrect to a semantic purist. Though I don't think any semantic purists would survive very long breathing the air of the OpenPhacts triple-store.

To a practical RDF hacker in OpenPhacts data it likely just comes down to whatever RDF statements can produce the right answers, that's the "correct" semantics.

ianwdunlop commented 7 years ago

I just removed the ?s rdfs:subClassOf http://purl.uniprot.org/core/Enzyme to see if it makes a difference.

randykerber commented 7 years ago

Does this look like the right answer:

@prefix skos:   <http://www.w3.org/2004/02/skos/core#> .
@prefix ns1:    <http://www.ebi.ac.uk/> .
@prefix ns2:    <http://purl.obolibrary.org/obo/> .
@prefix ns3:    <http://purl.uniprot.org/> .
@prefix ns4:    <http://www.openphacts.org/api#> .
@prefix ns5:    <http://www.ebi.ac.uk/chembl/> .
@prefix ns6:    <http://purl.org/dc/terms/> .

ns1:chebi
    skos:prefLabel  "ChEBI Ontology" .
ns2:doid
    skos:prefLabel  "Human Disease Ontology" .
<http://www.geneontology.org>
    skos:prefLabel  "GeneOntology" .
ns3:enzyme
    skos:prefLabel  "Enzyme Classification" ;
    ns4:rootNode    <http://purl.uniprot.org/enzyme/1.-.-.-> , <http://purl.uniprot.org/enzyme/2.-.-.-> , <http://purl.uniprot.org/enzyme/3.-.-.-> , <http://purl.uniprot.org/enzyme/4.-.-.-> , <http://purl.uniprot.org/enzyme/5.-.-.-> , <http://purl.uniprot.org/enzyme/6.-.-.-> .
<http://purl.uniprot.org/enzyme/1.-.-.->
    skos:prefLabel  "Oxidoreductases" .
<http://purl.uniprot.org/enzyme/2.-.-.->
    skos:prefLabel  "Transferases" .
<http://purl.uniprot.org/enzyme/3.-.-.->
    skos:prefLabel  "Hydrolases" .
<http://purl.uniprot.org/enzyme/4.-.-.->
    skos:prefLabel  "Lyases" .
<http://purl.uniprot.org/enzyme/5.-.-.->
    skos:prefLabel  "Isomerases" .
<http://purl.uniprot.org/enzyme/6.-.-.->
    skos:prefLabel  "Ligases" .
<http://www.bioassayontology.org>
    skos:prefLabel  "BioAssayOntology" .
ns5:target
    skos:prefLabel  "ChEMBL Target Hierarchy" .
ns4:conceptHierarchy
    ns6:hasPart ns3:enzyme .
randykerber commented 7 years ago

In SPARQL query, replaced this:

MINUS {?root_node rdfs:subClassOf []}

With this:

 MINUS {
  ?root_node rdfs:subClassOf ?super . 
  FILTER( ?super != ?root_node && ?super != <http://purl.uniprot.org/core/Enzyme> )
}
ianwdunlop commented 7 years ago

Yeah, that's probably one way to fix it but it never needed that before so why now? Anyway, I ran this:

SPARQL DELETE WHERE { GRAPH <http://purl.uniprot.org/enzyme/direct> {?s <http://www.w3.org/2000/01/rdf-schema#subClassOf> <http://purl.uniprot.org/core/Enzyme> . }};

Which fixed it. We can always add the subClassOf back in and use your query if we really need it. I guess it will not hurt to add your sparql into the API query anyway since the triple store will no doubt be restored from source files at some point. The question is where did that subClassOf come from. Has it always been there and someone has manually removed it during the load process?

PS. Nice sparql @randykerber :)

randykerber commented 7 years ago

Contents of graph <http://purl.uniprot.org/enzyme/inference> are added by this SPARQL query:

INSERT {
    GRAPH <http://purl.uniprot.org/enzyme/inference> {
        ?subclass <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?superclass .
        ?subclass <http://www.w3.org/2000/01/rdf-schema#subClassOf> ?subclass .
    }
}
WHERE {
    GRAPH <http://purl.uniprot.org/enzyme/direct> {
    ?subclass <http://www.w3.org/2000/01/rdf-schema#subClassOf>+ ?superclass ;
        [] []
    }
}

I comes from a file called "insert_queries.sparql" that came from inside the tar file called "enzyme.tar" from the file repository for openphacts v2.1 (here): https://data.openphacts.org/free/2.1/rdf/

randykerber commented 7 years ago

This issue appears to be fixed. Test queries do now return answers.

However, should look at the actual answers returned and see if they really make sense.

For example, for root=chembl the root is returned as: <http://rdf.ebi.ac.uk/resource/chembl/protclass/CHEMBL_PC_0> skos:prefLabel "Protein class" .

Root of chembl hierarchy is "Protein class"? That doesn't sound right.

randykerber commented 7 years ago

This query appears to show all 6 of the roots: http://alpha.openphacts.org:3002/tree

danidi commented 7 years ago

I think "Protein class" is ok, as all the children are actually proteins (it's the ChEMBL Protein Target Tree). At least it's the same behaviour as previously.

randykerber commented 7 years ago

@danidi -- ok, if it's doing what it was designed to do, and did before, I'll call that "working" and close this. Though it is a misleading label. Might some day consider renaming to something like "/tree?root=chembl_protein", or 2 params, e.g., "/tree?dataset=chembl&category=protein"