openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Default target class pharmacology query returns 500 error #382

Closed danidi closed 7 years ago

danidi commented 7 years ago

The default query for the target class pharmacology query https://beta.openphacts.org/2.1/target/tree/pharmacology/pages?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1 returns a 500 server error.

Might be related to this issue https://github.com/openphacts/GLOBAL/issues/184, although the limits there are higher than the default 10.

randykerber commented 7 years ago

Has this exact query worked at some point in the past? If so, when?

danidi commented 7 years ago

It worked at some point (it was chosen as default query as it has less results than the prevous default for class 1.-.-.-, and therefore was faster in returning data), but apparently not since 2.0.

I still have a workflow with data for https://beta.openphacts.org/1.5/target/tree/pharmacology/pages?app_id=15a18100&app_key=528a8272f1cd961d215f318a0315dd3d&target_organism=Homo+sapiens%7CMus+musculus&minEx-pChembl=7&uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F1.-.-.-&_pageSize=250&_page=55, so in 1.5 a related query was definitely working.

The count calls work fine: https://beta.openphacts.org/2.1/target/tree/pharmacology/count?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1

And the API call itself works fine with a ChEMBL classification query that should retrieve over 600 items (which is in the same order of magnitude than the enzyme call: https://beta.openphacts.org/2.1/target/members/count?app_id=15a18100&app_key=528a8272f1cd961d215f318a0315dd3d&uri=http%3A%2F%2Frdf.ebi.ac.uk%2Fresource%2Fchembl%2Fprotclass%2FCHEMBL_PC_6.

I think previously there was some cashing of the results done for the target class pharmacology calls, as these were very time consuming. Not sure if that is related here to the issue here.

randykerber commented 7 years ago

This is a bit of a mystery. If I extract the SPARQL query that gets created for that enzyme/6.2.-.- and pass it to the SPARQL endpoint, Virtuoso won't even run it, just returns an error because there's a variable '?node' in the SELECT statement that is not in the rest of the query. If I change the name of that SPARQL variable to '?node_uri' which is in the query body (don't know if that's really correct, just a guess), they query runs, but it eventually times out.

danidi commented 7 years ago

If I see this correctly, the ?node was introduced in this change https://github.com/openphacts/OPS_LinkedDataApi/commit/3f8f05d71c763dc0457bb6eeef17f82b54f42a0a#diff-bb99944febec6798782a86256658385a in the 03_10_targetTreePharma.ttl file. The changes were done for the results of the ChEMBL classification, but maybe they affected the enzyme query somehow?

randykerber commented 7 years ago

The IMS call and response for this LDA command are as follows:

Basically, the only mapping that IMS finds for <http://purl.uniprot.org/enzyme/6.2.-.-> is to itself.

Is this what it should return?

Request (split into lines and url-encoding removed for readability) :

  http://alpha.openphacts.org:3004/QueryExpander/mapUriRDF?rdfFormat=RDF/XML
& targetUriPattern=http://rdf.ebi.ac.uk/resource/chembl/protclass/
& targetUriPattern=http://purl.obolibrary.org/obo/CHEBI_
& targetUriPattern=http://purl.uniprot.org/enzyme/
& targetUriPattern=http://purl.obolibrary.org/obo/GO_
& targetUriPattern=http://www.bioassayontology.org/bao#BAO_
& targetUriPattern=http://purl.obolibrary.org/obo/DOID_
& overridePredicateURI=http://www.w3.org/2004/02/skos/core#exactMatch
& lensUri=Default
& Uri=http://purl.uniprot.org/enzyme/6.2.-.-

Response:

<rdf:RDF
    xmlns:ops="http://no/BaseURI/Set/"
    xmlns:void="http://rdfs.org/ns/void#"
    xmlns:dul="http://www.ontologydesignpatterns.org/ont/dul/DUL.owl#"
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<rdf:Description rdf:about="http://purl.uniprot.org/enzyme/6.2.-.-">
    <exactMatch xmlns="http://www.w3.org/2004/02/skos/core#" rdf:resource="http://purl.uniprot.org/enzyme/6.2.-.-"/>
</rdf:Description>

</rdf:RDF>
randykerber commented 7 years ago

Here's the SPARQL query. It's failing because it's timing out.

PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo_goa: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX cheminf: <http://semanticscience.org/resource/>
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo_goa: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX cheminf: <http://semanticscience.org/resource/>
SELECT DISTINCT ?item ?chembl_compound WHERE {
{ SELECT DISTINCT ?chembl_target ?chembl_compound ?ocrs_compound ?node_uri ?assay_uri ?item ?target_type ?issue ?dup ?conf_score WHERE {
 ?chembl_target chembl:hasProteinClassification|chembl:hasTargetComponent/chembl:targetCmptXref|chembl:hasTargetComponent/skos:exactMatch/obo_goa:C|chembl:hasTargetComponent/skos:exactMatch/obo_goa:F|chembl:hasTargetComponent/skos:exactMatch/obo_goa:P ?class .
 VALUES ?node_uri { <http://purl.uniprot.org/enzyme/6.2.-.->  } GRAPH ?g {
 ?class rdfs:subClassOf ?node_uri .
 }
 GRAPH <http://www.ebi.ac.uk/chembl> {
 ?item a chembl:Activity ;
 chembl:hasAssay ?assay_uri ;
 chembl:hasMolecule ?chembl_compound .
 ?assay_uri chembl:hasTarget ?chembl_target .
 ?chembl_target a ?target_type .
 OPTIONAL {
 GRAPH <http://ops.rsc.org> {
 ?ocrs_compound skos:exactMatch ?chembl_compound .
 }
 }
 OPTIONAL { ?item chembl:dataValidityIssue ?issue_tmp }
 BIND (IF (BOUND(?issue_tmp) , true, false) AS ?issue)
 OPTIONAL { ?item chembl:potentialDuplicate ?dup_tmp }
 BIND (IF (BOUND(?dup_tmp) , true, false) AS ?dup)
 OPTIONAL { ?assay_uri chembl:targetConfScore ?conf_score_tmp }
 BIND (IF (BOUND(?conf_score_tmp) , ?conf_score_tmp, 0) AS ?conf_score)
 }
} }
GRAPH <http://www.ebi.ac.uk/chembl> {
 { ?assay_uri chembl:organismName ?assay_organism }
 UNION { ?assay_uri dcterms:description ?assay_description }
 UNION { ?assay_uri chembl:assayTestType ?assay_type }
 UNION { ?assay_uri chembl:targetConfDesc ?conf_desc }
 UNION { ?assay_uri chembl:targetRelType ?rel_type ;
 chembl:targetRelDesc ?rel_desc }
 UNION { ?chembl_target dcterms:title ?target_name }
 UNION { ?chembl_target chembl:organismName ?target_organism }
 UNION { ?chembl_target chembl:hasTargetComponent ?protein .
 OPTIONAL {
 GRAPH <http://www.conceptwiki.org> {
 ?cw_target skos:exactMatch ?protein ;
 skos:prefLabel ?protein_name
 }
 }
 }
 UNION { ?item chembl:publishedType ?published_type }
 UNION { ?item chembl:publishedRelation ?published_relation }
 UNION { ?item chembl:publishedValue ?published_value }
 UNION { ?item chembl:publishedUnits ?published_unit }
 UNION { ?item chembl:standardType ?activity_type }
 UNION { ?item chembl:standardRelation ?activity_relation }
 UNION { ?item chembl:standardValue ?std_value .
 BIND (xsd:decimal(?std_value) as ?activity_value)
 }
 UNION { ?item chembl:standardUnits ?activity_unit }
 UNION { ?item chembl:hasQUDT ?qudt_uri }
 UNION { ?item chembl:pChembl ?pChembl }
 UNION { ?item chembl:activityComment ?act_comment }
 UNION { ?item chembl:hasDocument ?document .
 { ?document owl:sameAs ?doi }
 UNION { ?document bibo:pmid ?pmid }
 }
 UNION { ?item chembl:dataValidityComment ?comment}  UNION {
 GRAPH <http://ops.rsc.org> {
 { ?ocrs_compound cheminf:CHEMINF_000396 ?inchi }
 UNION {?ocrs_compound cheminf:CHEMINF_000399 ?inchi_key}
 UNION {?ocrs_compound cheminf:CHEMINF_000018 ?smiles }
 }
 }
}
 } ORDER BY ?item  LIMIT 10 OFFSET 0
ianwdunlop commented 7 years ago

There may be some data missing or in the wrong graph. If you break the query down a bit then you can see that there is something strange happening:

PREFIX ops: <http://www.openphacts.org/api#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
SELECT DISTINCT ?chembl_target  WHERE {
  {
    VALUES ?node_uri {
      <http://purl.uniprot.org/enzyme/6.2.-.->  
    } GRAPH ?g {
      ?class rdfs:subClassOf ?node_uri
    }
    GRAPH <http://www.ebi.ac.uk/chembl> {
      {
        ?chembl_target chembl:hasTargetComponent/chembl:targetCmptXref ?class  
      }
    }
  }
} LIMIT 10

This query returns no results. So there appears to be no link between chembl target and subclasses of enzyme <http://purl.uniprot.org/enzyme/6.2.-.->. However if you remove the clause on the GRAPH then you do get some results:

PREFIX ops: <http://www.openphacts.org/api#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
SELECT DISTINCT ?chembl_target  WHERE {
  {
    VALUES ?node_uri {
      <http://purl.uniprot.org/enzyme/6.2.-.->  
    } GRAPH ?g {
      ?class rdfs:subClassOf ?node_uri
    }
    {
      ?chembl_target chembl:hasTargetComponent/chembl:targetCmptXref ?class  
    }
  }
} LIMIT 10

You can do something similar by counting the number of URIs that have the enzyme bit:

PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
SELECT (COUNT(?class) as ?class_count) WHERE {
#  GRAPH <http://www.ebi.ac.uk/chembl> {
    ?chembl_target chembl:hasTargetComponent/chembl:targetCmptXref ?class  
    FILTER ( strstarts(str(?class), "http://purl.uniprot.org/enzyme/") )
#  }
}

Here it is 5000+. If you uncomment the GRAPH bit then you get 0.

The usual RDF & SPARQL blindness rules apply here. Please try for yourself :)

ianwdunlop commented 7 years ago

As far as I can figure out it is all the 'filterable' values that are causing the problem ie all the items in the second GRAPH clause. If you leave this section out the query finishes in an almost acceptable time. With them in it times out.

ianwdunlop commented 7 years ago

Now this next statement is not scientifically proven but it seems that for every filter you add the query takes an extra ~50 seconds to run. The first one

    {
      ?assay_uri chembl:organismName ?assay_organism 
    }

doesn't seem to add much query time but the UNION statements after it do

    UNION {
      ?assay_uri dcterms:description ?assay_description 
    } 
randykerber commented 7 years ago

The SPARQL query makes no sense to me. Hard to say it's wrong without knowing what it's trying to accomplish.

It would seem meant to produce a set of ?item ?chembl_compound pairs.

But the entire second half of the query (starting at GRAPH <http://www.ebi.ac.uk/chembl> {) is entirely one big UNION statement. It appears to serve no purpose other than to make the query timeout. None of the object-values of those union triples appear in the first half of the query, so would seem to have no effect on the actual answer produced.

Maybe this query was produced by trying to copy-n-paste from somewhere else and something was botched? I dunno. But to be able to fix it, would need to know what it is supposed to be doing. Should it just be returning ?item and ?chembl_compound? Is it supposed to be returning a bunch of information?

danidi commented 7 years ago

The response from the target class pharmacology query should be similar to the target pharmacology pages API call, but retrieve the information for all targets that are in a specific target class. So for several targets it would retrieve bioactivity information with several compounds, together with information e.g. assay description.

ianwdunlop commented 7 years ago

After some more digging around we discovered that this call actually sends 2 queries. The first one from above finds out what chembl activities/compounds are involved by applying all the user supplied filters. Then another one fetches the data for each of them. The filters take forever to run. Maybe we can remove them. The user will still get back all the info about the hierarchy just not the filtered version.

randykerber commented 7 years ago

The default query now appears to work on alpha:

randykerber commented 7 years ago

http://alpha.openphacts.org:3002/target/tree/pharmacology/pages?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-

randykerber commented 7 years ago

Can someone scan the results and see if it looks right?

Plus check any other test calls of /target/tree/phamacology/pages.