openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

/target/members/count counts more data than /target/members/pages shows #394

Open jakhag opened 7 years ago

jakhag commented 7 years ago

For http://purl.uniprot.org/enzyme/6.2.-.- the count is 1371:

http://alpha.openphacts.org:3002/target/members/count?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1

But the target class members list does not retrieve as much data. It fails at page 2:

http://alpha.openphacts.org:3002/target/members/pages?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&_page=1&_pageSize=500

http://alpha.openphacts.org:3002/target/members/pages?uri=http%3A%2F%2Fpurl.uniprot.org%2Fenzyme%2F6.2.-.-&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1&_page=2&_pageSize=500 404: page not found

ianwdunlop commented 7 years ago

Looking at the sparql for the pages query it seems that the members have no dcterms:title. However, they do have an rdfs:label. Maybe we should use that instead. But is it correct? When I changed it to rdfs:label I get the following (abridged to save space). You will notice that not all of the items have info attached. Is this to be expected? Or is there something else going on? It is possible that some data that is expected is also missing.

<items>
<item href="http://purl.uniprot.org/uniprot/E0TXE1"/>
<item href="http://purl.uniprot.org/uniprot/E1UV19"/>
<item href="http://purl.uniprot.org/uniprot/E3E2E2"/>
<item href="http://purl.uniprot.org/uniprot/E3UUE6"/>
<item href="http://purl.uniprot.org/uniprot/E7FHP1"/
<item href="http://purl.uniprot.org/uniprot/O14975">
  <prefLabel>Very long-chain acyl-CoA synthetase</prefLabel>
    <exactMatch href="http://rdf.ebi.ac.uk/resource/chembl/target/CHEMBL4326">
    <prefLabel>Fatty acid transport protein 2</prefLabel>
    <type href="http://rdf.ebi.ac.uk/terms/chembl#SingleProtein"/>
    <inDataset href="http://www.ebi.ac.uk/chembl"/>
    <target_organism>Homo sapiens</target_organism>
  </exactMatch>
  <inDataset href="http://purl.uniprot.org"/>
  <target_organism_uri href="http://purl.uniprot.org/taxonomy/9606"/>
</item
<item href="http://purl.uniprot.org/uniprot/O22898"/>
</items>
ianwdunlop commented 7 years ago

Here is the sparql query below. I changed it to look for ?item dcterms:title|rdfs:label ?chembl_name ie either dcterms:title or rdfs:label. BTW this API call is one of those 2 part ones where it first finds all the items and then gets the properties in a different call. Not really sure is it needs those OPTIONAL blocks or not.

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX goa: <http://www.semantic-systems-biology.org/ontology/rdf/GOA#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo: <http://www.semantic-systems-biology.org/ontology/rdf/OBO#>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX chembl: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX goa: <http://www.semantic-systems-biology.org/ontology/rdf/GOA#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?item  WHERE {VALUES ?g { <http://purl.uniprot.org/enzyme/inference> <http://www.ebi.ac.uk/chembl/target/inference> <http://www.geneontology.org/inference> }
 VALUES ?node_uri { <http://purl.uniprot.org/enzyme/6.2.-.->  } GRAPH ?g {
 ?child_node rdfs:subClassOf ?node_uri.
 FILTER ( isURI(?child_node) )
}
{ ?item obo:C ?child_node .
?item uniprot:reviewed true }
UNION { ?item obo:F ?child_node .
?item uniprot:reviewed true }
UNION { ?item obo:P ?child_node .
?item uniprot:reviewed true }
UNION { ?item uniprot:enzyme|uniprot:domain/uniprot:enzyme|chembl:hasProteinClassification ?child_node }
VALUES ?g2 {<http://purl.uniprot.org> <http://www.ebi.ac.uk/chembl> <http://www.openphacts.org/goa> }
GRAPH ?g2 {
    ?item [] []
}
{ 
      ?item dcterms:title|rdfs:label ?chembl_name
FILTER (?chembl_name != '') 
  }
UNION { ?item goa:description ?uniprot_name
FILTER (?uniprot_name != '') }
OPTIONAL {
 {?mapping skos:relatedMatch/skos:exactMatch ?item }
 UNION { ?item skos:relatedMatch/skos:exactMatch ?mapping }
 MINUS { ?mapping a chembl:ProteinComplexGroup }
 { ?mapping goa:description ?mapping_name }
 UNION { ?mapping dcterms:title ?mapping_name }
 FILTER ( ?mapping_name != '' )
 { ?mapping uniprot:organism ?mapping_org_uri }
 UNION { ?mapping chembl:organismName ?mapping_org
 GRAPH <http://www.ebi.ac.uk/chembl> {
 ?mapping a ?mapping_type
 FILTER ( ?mapping_type != chembl:UniprotRef )
 }
 }
 BIND(IF(BOUND(?mapping_org), <http://www.ebi.ac.uk/chembl>, <http://purl.uniprot.org>) AS ?mapping_dataset)
}
OPTIONAL { ?item uniprot:organism ?uniprot_organism
 BIND (?item AS ?uniprot_target) }
OPTIONAL {
 GRAPH <http://www.ebi.ac.uk/chembl> {
 ?item a ?target_type
 }
}
OPTIONAL { ?item chembl:organismName ?chembl_organism
 BIND (?item AS ?chembl_target) }
 } ORDER BY ?item  LIMIT 500 OFFSET 500
danidi commented 7 years ago

I would expect a prefLabel from Uniprot for each of the items, but not necessarily from ChEMBL.

ianwdunlop commented 7 years ago

Takes about 0.6 seconds compared to 1.8 seconds if the optional are removed from the query.

danidi commented 7 years ago

The organism and organism name are filters for the query. Does it make the query slower, even if no filter parameter is set by the user?

ianwdunlop commented 7 years ago

Ok, thanks @danidi. So the optionals are for the filters. It does make it slower if no filters are set but no real way to avoid that without a bit of a code re-write. It's not massively slower though.