Closed danidi closed 7 years ago
Has this exact query worked at some point in the past? If so, when?
It worked at some point (it was chosen as default query as it has less results than the prevous default for class 1.-.-.-, and therefore was faster in returning data), but apparently not since 2.0.
I still have a workflow with data for https://beta.openphacts.org/1.5/target/tree/pharmacology/pages?app_id=15a18100&app_key=528a8272f1cd961d215f318a0315dd3d&target_organism=Homo+sapiens%7CMus+musculus&minEx-pChembl=7&uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F1.-.-.-&_pageSize=250&_page=55, so in 1.5 a related query was definitely working.
The count calls work fine: https://beta.openphacts.org/2.1/target/tree/pharmacology/count?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1
And the API call itself works fine with a ChEMBL classification query that should retrieve over 600 items (which is in the same order of magnitude than the enzyme call: https://beta.openphacts.org/2.1/target/members/count?app_id=15a18100&app_key=528a8272f1cd961d215f318a0315dd3d&uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fchembl%2Fprotclass%2FCHEMBL_PC_6.
I think previously there was some cashing of the results done for the target class pharmacology calls, as these were very time consuming. Not sure if that is related here to the issue here.
This is a bit of a mystery. If I extract the SPARQL query that gets created for that enzyme/6.2.-.- and pass it to the SPARQL endpoint, Virtuoso won't even run it, just returns an error because there's a variable '?node' in the SELECT statement that is not in the rest of the query. If I change the name of that SPARQL variable to '?node_uri' which is in the query body (don't know if that's really correct, just a guess), they query runs, but it eventually times out.
If I see this correctly, the ?node was introduced in this change https://github.com/openphacts/OPS_LinkedDataApi/commit/3f8f05d71c763dc0457bb6eeef17f82b54f42a0a#diff-bb99944febec6798782a86256658385a in the 03_10_targetTreePharma.ttl file. The changes were done for the results of the ChEMBL classification, but maybe they affected the enzyme query somehow?
The IMS call and response for this LDA command are as follows:
Basically, the only mapping that IMS finds for <http://purl.uniprot.org/enzyme/6.2.-.->
is to itself.
Is this what it should return?
Request (split into lines and url-encoding removed for readability) :
http://alpha.openphacts.org:3004/QueryExpander/mapUriRDF?rdfFormat=RDF/XML
& targetUriPattern=http://rdf.ebi.ac.uk/resource/chembl/protclass/
& targetUriPattern=http://purl.obolibrary.org/obo/CHEBI_
& targetUriPattern=http://purl.uniprot.org/enzyme/
& targetUriPattern=http://purl.obolibrary.org/obo/GO_
& targetUriPattern=http://www.bioassayontology.org/bao#BAO_
& targetUriPattern=http://purl.obolibrary.org/obo/DOID_
& overridePredicateURI=http://www.w3.org/2004/02/skos/core#exactMatch
& lensUri=Default
& Uri=http://purl.uniprot.org/enzyme/6.2.-.-
Response:
<rdf:RDF
xmlns:ops="http://no/BaseURI/Set/"
xmlns:void="http://rdfs.org/ns/void#"
xmlns:dul="http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="http://purl.uniprot.org/enzyme/6.2.-.-">
<exactMatch xmlns="http://www.w3.org/2004/02/skos/core#" rdf:resource="http://purl.uniprot.org/enzyme/6.2.-.-"/>
</rdf:Description>
</rdf:RDF>
Here's the SPARQL query. It's failing because it's timing out.
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo_goa: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX cheminf: <http://semanticscience.org/resource/>
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo_goa: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX cheminf: <http://semanticscience.org/resource/>
SELECT DISTINCT ?item ?chembl_compound WHERE {
{ SELECT DISTINCT ?chembl_target ?chembl_compound ?ocrs_compound ?node_uri ?assay_uri ?item ?target_type ?issue ?dup ?conf_score WHERE {
?chembl_target chembl:hasProteinClassification|chembl:hasTargetComponent/chembl:targetCmptXref|chembl:hasTargetComponent/skos:exactMatch/obo_goa:C|chembl:hasTargetComponent/skos:exactMatch/obo_goa:F|chembl:hasTargetComponent/skos:exactMatch/obo_goa:P ?class .
VALUES ?node_uri { <http://purl.uniprot.org/enzyme/6.2.-.-> } GRAPH ?g {
?class rdfs:subClassOf ?node_uri .
}
GRAPH <http://www.ebi.ac.uk/chembl> {
?item a chembl:Activity ;
chembl:hasAssay ?assay_uri ;
chembl:hasMolecule ?chembl_compound .
?assay_uri chembl:hasTarget ?chembl_target .
?chembl_target a ?target_type .
OPTIONAL {
GRAPH <http://ops.rsc.org> {
?ocrs_compound skos:exactMatch ?chembl_compound .
}
}
OPTIONAL { ?item chembl:dataValidityIssue ?issue_tmp }
BIND (IF (BOUND(?issue_tmp) , true, false) AS ?issue)
OPTIONAL { ?item chembl:potentialDuplicate ?dup_tmp }
BIND (IF (BOUND(?dup_tmp) , true, false) AS ?dup)
OPTIONAL { ?assay_uri chembl:targetConfScore ?conf_score_tmp }
BIND (IF (BOUND(?conf_score_tmp) , ?conf_score_tmp, 0) AS ?conf_score)
}
} }
GRAPH <http://www.ebi.ac.uk/chembl> {
{ ?assay_uri chembl:organismName ?assay_organism }
UNION { ?assay_uri dcterms:description ?assay_description }
UNION { ?assay_uri chembl:assayTestType ?assay_type }
UNION { ?assay_uri chembl:targetConfDesc ?conf_desc }
UNION { ?assay_uri chembl:targetRelType ?rel_type ;
chembl:targetRelDesc ?rel_desc }
UNION { ?chembl_target dcterms:title ?target_name }
UNION { ?chembl_target chembl:organismName ?target_organism }
UNION { ?chembl_target chembl:hasTargetComponent ?protein .
OPTIONAL {
GRAPH <http://www.conceptwiki.org> {
?cw_target skos:exactMatch ?protein ;
skos:prefLabel ?protein_name
}
}
}
UNION { ?item chembl:publishedType ?published_type }
UNION { ?item chembl:publishedRelation ?published_relation }
UNION { ?item chembl:publishedValue ?published_value }
UNION { ?item chembl:publishedUnits ?published_unit }
UNION { ?item chembl:standardType ?activity_type }
UNION { ?item chembl:standardRelation ?activity_relation }
UNION { ?item chembl:standardValue ?std_value .
BIND (xsd:decimal(?std_value) as ?activity_value)
}
UNION { ?item chembl:standardUnits ?activity_unit }
UNION { ?item chembl:hasQUDT ?qudt_uri }
UNION { ?item chembl:pChembl ?pChembl }
UNION { ?item chembl:activityComment ?act_comment }
UNION { ?item chembl:hasDocument ?document .
{ ?document owl:sameAs ?doi }
UNION { ?document bibo:pmid ?pmid }
}
UNION { ?item chembl:dataValidityComment ?comment} UNION {
GRAPH <http://ops.rsc.org> {
{ ?ocrs_compound cheminf:CHEMINF_000396 ?inchi }
UNION {?ocrs_compound cheminf:CHEMINF_000399 ?inchi_key}
UNION {?ocrs_compound cheminf:CHEMINF_000018 ?smiles }
}
}
}
} ORDER BY ?item LIMIT 10 OFFSET 0
There may be some data missing or in the wrong graph. If you break the query down a bit then you can see that there is something strange happening:
PREFIX ops: <http://www.openphacts.org/api#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
SELECT DISTINCT ?chembl_target WHERE {
{
VALUES ?node_uri {
<http://purl.uniprot.org/enzyme/6.2.-.->
} GRAPH ?g {
?class rdfs:subClassOf ?node_uri
}
GRAPH <http://www.ebi.ac.uk/chembl> {
{
?chembl_target chembl:hasTargetComponent/chembl:targetCmptXref ?class
}
}
}
} LIMIT 10
This query returns no results. So there appears to be no link between chembl target and subclasses of enzyme <http://purl.uniprot.org/enzyme/6.2.-.->
. However if you remove the clause on the GRAPH
then you do get some results:
PREFIX ops: <http://www.openphacts.org/api#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
SELECT DISTINCT ?chembl_target WHERE {
{
VALUES ?node_uri {
<http://purl.uniprot.org/enzyme/6.2.-.->
} GRAPH ?g {
?class rdfs:subClassOf ?node_uri
}
{
?chembl_target chembl:hasTargetComponent/chembl:targetCmptXref ?class
}
}
} LIMIT 10
You can do something similar by counting the number of URIs that have the enzyme
bit:
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
SELECT (COUNT(?class) as ?class_count) WHERE {
# GRAPH <http://www.ebi.ac.uk/chembl> {
?chembl_target chembl:hasTargetComponent/chembl:targetCmptXref ?class
FILTER ( strstarts(str(?class), "http://purl.uniprot.org/enzyme/") )
# }
}
Here it is 5000+. If you uncomment the GRAPH
bit then you get 0.
The usual RDF & SPARQL blindness rules apply here. Please try for yourself :)
As far as I can figure out it is all the 'filterable' values that are causing the problem ie all the items in the second GRAPH
clause. If you leave this section out the query finishes in an almost acceptable time. With them in it times out.
Now this next statement is not scientifically proven but it seems that for every filter you add the query takes an extra ~50 seconds to run. The first one
{
?assay_uri chembl:organismName ?assay_organism
}
doesn't seem to add much query time but the UNION
statements after it do
UNION {
?assay_uri dcterms:description ?assay_description
}
The SPARQL query makes no sense to me. Hard to say it's wrong without knowing what it's trying to accomplish.
It would seem meant to produce a set of ?item ?chembl_compound pairs.
But the entire second half of the query (starting at GRAPH <http://www.ebi.ac.uk/chembl> {
)
is entirely one big UNION statement. It appears to serve no purpose other than to make the query timeout. None of the object-values of those union triples appear in the first half of the query, so would seem to have no effect on the actual answer produced.
Maybe this query was produced by trying to copy-n-paste from somewhere else and something was botched? I dunno. But to be able to fix it, would need to know what it is supposed to be doing. Should it just be returning ?item and ?chembl_compound? Is it supposed to be returning a bunch of information?
The response from the target class pharmacology query should be similar to the target pharmacology pages API call, but retrieve the information for all targets that are in a specific target class. So for several targets it would retrieve bioactivity information with several compounds, together with information e.g. assay description.
After some more digging around we discovered that this call actually sends 2 queries. The first one from above finds out what chembl activities/compounds are involved by applying all the user supplied filters. Then another one fetches the data for each of them. The filters take forever to run. Maybe we can remove them. The user will still get back all the info about the hierarchy just not the filtered version.
The default query now appears to work on alpha:
Can someone scan the results and see if it looks right?
Plus check any other test calls of /target/tree/phamacology/pages.
The default query for the target class pharmacology query https://beta.openphacts.org/2.1/target/tree/pharmacology/pages?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1 returns a 500 server error.
Might be related to this issue https://github.com/openphacts/GLOBAL/issues/184, although the limits there are higher than the default 10.