openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

/disease/assoc/byTarget API command fails in 2.2 #384

Closed randykerber closed 7 years ago

randykerber commented 7 years ago

The following API command works in 2.1 but fails in 2.2

For 2.1 (host = beta):

https://beta.openphacts.org/2.1/disease/assoc/byTarget?uri=http%3A%2F%2Fpurl.uniprot.org%2Funiprot%2FQ9Y5Y9&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1

For 2.2 (host = alpha):

http://alpha.openphacts.org:3002/disease/assoc/byTarget?uri=http%3A%2F%2Fpurl.uniprot.org%2Funiprot%2FQ9Y5Y9

randykerber commented 7 years ago

More extensive writeup here: https://github.com/openphacts/GLOBAL/blob/master/issues/384-disease-by-target/disease-assoc-target.md

randykerber commented 7 years ago

Root cause seems to be missing Ensembl linkset(s).

The version 2.1 Ensembl-Human Linksets are here: https://data.openphacts.org/free/2.1/ims/linksets/data/ops-ensembl-homosapiens-linksets/

The version 2.2 versions come from here: http://www.bridgedb.org/data/linksets/current/HomoSapiens/

This particular bug seems to be caused by absence of the file named Ensembl_Hs_ncbigene.direct.LS.ttl.gz.

But there are several files included in 2.1 where there is no refreshed version (at least not. yet).

Which of these Linkset files from 2.1 will have refreshed versions? Which should not be loaded anymore? For which should the old versions just be reused?

randykerber commented 7 years ago

This fails if using v2.2 IMS but appears to work if using v2.1 IMS. Result is same whether using v2.1 or v2.2 SPARQL. My initial hypothesis was that the source of error was that the linkset file Ensembl_Hs_ncbigene.direct.LS.ttl.gz from v2.1 is not present in v2.2. I don't remember the explanations and details.

randykerber commented 7 years ago

Here is the SPARQL query:

<DELETED wrong sparql query -- correct one included below>
randykerber commented 7 years ago

The IMS mapping results differ between v2.1 and v2.2.

version 2.1

http://beta.openphacts.org:3004/QueryExpander/mapUri?Uri=http%3A%2F%2Fpurl.uniprot.org%2Funiprot%2FQ9Y5Y9&lensUri=http%3A%2F%2Fopenphacts.org%2Fspecs%2F%2FLens%2FDefault&Pattern+Filter=&overridePredicateURI=&format=text%2Fhtml

version 2.2

http://alpha.openphacts.org:3004/QueryExpander/mapUri?Uri=http%3A%2F%2Fpurl.uniprot.org%2Funiprot%2FQ9Y5Y9&lensUri=http%3A%2F%2Fopenphacts.org%2Fspecs%2F%2FLens%2FDefault&Pattern+Filter=&overridePredicateURI=&format=text%2Fhtml

randykerber commented 7 years ago

The SPARQL query times out, exceeding the limit of 15 minutes.

danidi commented 7 years ago

@randykerber Is this really the correct SPARQL query for this issue? It should be disease associations by target, but the SPARQL query uses an enzyme URL and asks for pharmacology from ChEMBL.

Disease associations are available for genes, the IMS v2.1 mappings provide these, v2.2 have only protein URIs.

randykerber commented 7 years ago

@danidi -- must have grabbed the wrong query out of the log. Re-ran it now, here's the SPARQL:

Returns no results. The VALUES() statement contains the URI: <http://www.openphacts.org/api#no_mappings_found>

PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ncit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ncit: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT ?item  WHERE {  VALUES ?uniprot_target_uri { <http://purl.uniprot.org/uniprot/Q9Y5Y9>  } GRAPH <http://purl.uniprot.org> {
 ?uniprot_target_uri uniprot:existence ?existence .
}
 VALUES ?dg_gene_uri { <http://www.openphacts.org/api#no_mappings_found> } GRAPH <http://rdf.imim.es> {
 ?item sio:SIO_000628 ?dg_gene_uri .
 ?dg_gene_uri rdf:type ncit:C16612 ;
 void:inDataset ?geneDataset .  ?item rdf:type ?type .
 ?type rdfs:label ?type_label .  ?item sio:SIO_000253 ?primarySource .  ?item sio:SIO_000628 ?disease .
 ?disease rdf:type ncit:C7057 ;
 foaf:name ?diseaseName ;
 void:inDataset ?diseaseDataset .
 ?item void:inDataset ?assocDataset .
 OPTIONAL { ?disease sio:SIO_000095 ?diseaseClass .
 ?diseaseClass foaf:name ?diseaseClassName ;
 void:inDataset ?diseaseClassDataset .
 }
 OPTIONAL { ?item dcterms:description ?description .}
 OPTIONAL { ?item sio:SIO_000772 ?pubmed_id . }
}
OPTIONAL {  VALUES ?cw_target_uri { <http://www.conceptwiki.org/concept/index/00059958-a045-4581-9dc5-e5a08bb0c291> <http://www.conceptwiki.org/concept/00059958-a045-4581-9dc5-e5a08bb0c291>  } GRAPH <http://www.conceptwiki.org> {
 ?cw_target_uri skos:prefLabel ?cw_prefLabel .
}} } ORDER BY ?item  LIMIT 10 OFFSET 0
randykerber commented 7 years ago

When I re-run the query but use the v2.1 IMS on beta instead of the v2.2 IMS on alpha, the constructed query is the same except that the <http://www.openphacts.org/api#no_mappings_found> is replaced with <http://identifiers.org/ncbigene/6336>, and that SPARQL query returns 10 results.

randykerber commented 7 years ago

Adding link to a note created from an email from Daniela on this issue:

https://github.com/openphacts/GLOBAL/blob/master/issues/384-disease-by-target/disease-assoc-disease.md

randykerber commented 7 years ago

I think I have figured out the cause of this one. In v2.1, the linkset file uniprot_geneid.ttl.gz contains this link: ns1:Q9Y5Y9 rdfs:seeAlso <http://purl.uniprot.org/geneid/6336> . where "ns1:" maps to <http://purl.uniprot.org/uniprot/>. In the v2.2 version of uniprot_geneid.ttl.gz there is no mapping for uniprot Q9Y5Y9. The uniprot linksets are created with SPARQL queries run against the uniprot data in virtuoso. But if I run a SPARQL query on v2.2 SPARQL endpoint, the mapping from uniprot/Q9Y5Y9 to ncbigene/6336 is in there! So why isn't it in the "uniprot_geneid" linkset file? If I go into the directory where the uniprot linkset files are saved and check the number of lines in each file I get the following:

alpha:~/d/era/b/staging/links/uniprot$ wc -l *.nt
   10001 uniprot_ensembl.nt
    6149 uniprot_flybase.nt
   10001 uniprot_geneid.nt
   10001 uniprot_mgi.nt
   10001 uniprot_omim.nt
   10001 uniprot_pdb.nt
   10001 uniprot_refseq.nt
    7912 uniprot_rgd.nt
    6740 uniprot_sgd.nt
   10001 uniprot_unigene.nt
    5781 uniprot_wormbase.nt
    2840 uniprot_zfin.nt

Gee, what a coincidence -- 6 files all have exactly 10001 lines. How can that be?? A Virtuoso parameter has restricted the number of results returned to 10000. The rest never make it into the linkset file.

randykerber commented 7 years ago

The test query appears to now be woking.

Can someone look at the result and see if it looks like a decent answer?

ianwdunlop commented 7 years ago

Those answers look the same to me in alpha and beta. Sounds like a fix :)

ianwdunlop commented 7 years ago

I guess that means all those linksets have to be regenerated :( I hit that same 10000 result limit with ops-search a while back and had to change it. Caught me out at as well.

randykerber commented 7 years ago

@ianwdunlop -- As far as I can tell, the 12 new Uniprot linksets (with all triples, not just first 10,000) appear to be in IMS on alpha. At least there weren't any error messages.

On the other hand, the Ensembl Mouse linksets did not load. Have not yet investigated why.

randykerber commented 7 years ago

If "same answer in 2.2 as in 2.1" means it's a "fix", then it's a fix.