Closed danidi closed 9 years ago
I think this is the same IMS upon integration, IMS upoe query issue we have seen before. When you query for the genes in the pathway the gene are mapped to the identifier scheme you ask for (as they should be), when you query with a specific gene that apparently does happen, so if the gene in the pathway was annotated with another type of ID scheme the pathway isn't found.
Daniela and I looked a little into this and we found that querying the WP endpoint with a gene PTK2B (http://identifiers.org/ncbigene/2185) gave a list of pathways (list at end of message)that the DID NOT include the WP49, although the gene is found in the pathway. Similarly querying the WP SPARQL endpoint with JAK3 (http://identifiers.org/ncbigene/3718) returns no pathways although this gene is found in WP49 (IL-2 signaling pathway). @Andra can you investigate this a little in the WP RDF?
List of pathways for PTK2B http://rdf.wikipathways.org/Pathway/WP794_r67074 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP978_r67181 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP2263_r67609 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP973_r67373 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP855_r67372 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP1025_r67068 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP1091_r67366 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP1144_r67069 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP908_r67075 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP747_r67371 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP1096_r68551 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP313_r69027 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP313_r69027/group/c1e29 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP860_r69215 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP752_r68850 "PTK2B"@en
I fear the issue with having a different backpagehead value than the protein name is not (fully) explaining this issue. I tried with 6772 (Stat1), but I can't find the IL-2 signaling pathway with this as well (tested with the OPS API). It rather seems that a whole pathway can't be found over any of its targets, while the targets can find other pathways.
Here is the list I got from the initial support ticket (Gene IDs with expected pathways). Maybe the pathways have something in common?
41 DNA_Replication
6850 B_Cell_Receptor_Signaling_Pathway IL-2_Signaling_pathway IL-3_Signaling_Pathway IL-5_signaling_pathway RANKL/RANK_Signaling_Pathway Regulation_of_toll-like_receptor_signaling_pathway
3718 IL-2_Signaling_pathway IL-4_signaling_pathway IL-7_signaling_pathway IL-9_signaling_pathway
695 B_Cell_Receptor_Signaling_Pathway IL-5_signaling_pathway Kit_receptor_signaling_pathway Regulation_of_toll-like_receptor_signaling_pathway
10498 Androgen_receptor_signaling_pathway
11035 TNF_alpha_Signaling_Pathway
10987 TGF_beta_Signaling_Pathway
1457 TNF_alpha_Signaling_Pathway
7329 Androgen_receptor_signaling_pathway TGF_beta_Signaling_Pathway
92 B_Cell_Receptor_Signaling_Pathway
23028 Androgen_receptor_signaling_pathway
4353 Folate_Metabolism Selenium_Pathway Vitamin_B12_Metabolism
1509 Prolactin_Signaling_Pathway
23410 Energy_Metabolism
From the OPS API.
Hi, I can't reproduce the issue on the RDF side. First I used the following SPARQL query to verify that http://identifiers.org/ncbigene/2185 is part of WP49. It is.
SELECT DISTINCT ?gpIdentifier WHERE { ?pathway ?p http://identifiers.org/wikipathways/WP49 . ?gp dcterms:isPartOf ?pathway . ?gp a gpml:DataNode . ?gp dc:identifier ?gpIdentifier}
The other way around asking for pathways with http://identifiers.org/ncbigene/2185 also returns WP49 as a pathway containing http://identifiers.org/ncbigene/2185.
SELECT DISTINCT ?ioWP WHERE { ?pathway dc:identifier ?ioWP . ?gp dcterms:isPartOf ?pathway . ?gp a gpml:DataNode . ?gp dc:identifier http://identifiers.org/ncbigene/2185}
Both queries can be executed on http://sparql.wikipathways.org
Doing the same for "http://identifiers.org/ncbigene/3718" does return the expected results.
I am stuck here because the IMS isn't relevant since http://identifiers.org/ncbigene/2185 is the original identifier in the pathway. There is no need for mappings.
@ChristineChichester could you sent me the queries you used?
It could be that the RDF is not up-to-date in the platform.
Paul
On Tue, May 20, 2014 at 12:08 PM, andrawaag notifications@github.comwrote:
Hi, I can't reproduce the issue on the RDF side. First I used the following SPARQL query to verify that http://identifiers.org/ncbigene/2185 is part of WP49. It is.
SELECT DISTINCT ?gpIdentifier WHERE { ?pathway ?p http://identifiers.org/wikipathways/WP49 . ?gp dcterms:isPartOf ?pathway . ?gp a gpml:DataNode . ?gp dc:identifier ?gpIdentifier}
The other way around asking for pathways with http://identifiers.org/ncbigene/2185 also returns WP49 as a pathway containing http://identifiers.org/ncbigene/2185.
SELECT DISTINCT ?ioWP WHERE { ?pathway dc:identifier ?ioWP . ?gp dcterms:isPartOf ?pathway . ?gp a gpml:DataNode . ?gp dc:identifier http://identifiers.org/ncbigene/2185}
Both queries can be executed on http://sparql.wikipathways.org
Doing the same for "http://identifiers.org/ncbigene/3718" does return the expected results.
I am stuck here because the IMS isn't relevant since http://identifiers.org/ncbigene/2185 is the original identifier in the pathway. There is no need for mappings.
@ChristineChichester https://github.com/ChristineChichester could you sent me the queries you used?
— Reply to this email directly or view it on GitHubhttps://github.com/openphacts/GLOBAL/issues/131#issuecomment-43612334 .
Dr. Paul Groth (p.t.groth@vu.nl) http://www.few.vu.nl/~pgroth/ Assistant Professor
But if it is a version problem (e.g. the protein/pathway connection is not in the OPS RDF), then I wouldn't expect to find the target via the pathway. But this way the connection can be retrieved.
I just had an offline discussion with Christine on this. I ran into a similar problem when running a federated query between Wikipathways, Uniprot and DisGenet. Some links weren't simply made although I was pretty sure the data was in.
It appears that there is an issue with the namespace for Entrez Gene. There are actually two and both aren't mapped to each other. (see: https://beta.openphacts.org/1.4/mapUri?app_id=18983b12&app_key=c99cf43da48a1a2f9069651fe6be7c06&Uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F2185) Both http://identifiers.org/ncbigene and http://identifiers.org/entrez.gene resolve to an entry in Entrez gene. The latter being updated somewhere around 2012, deprecating the entrez.gene namespace
In RDF of Wikipathways both name spaces are used. There are two predicates for an identifier being dc:identifier and wp:bdbEntrezGene. dc:identifier points to the original identifier added by the pathway curator, wp:bdbEntrezGene is a normalised identifier (mapped through BridgeDB). Since bridgeDb still uses the older (meriam-based) namespace, we ended up with the Entrez.gene based uri.
To know whether or not this also causes the observed difference in recall we need to know which predicate is used to capture the uri of a concept in the api calls.
I am working on a quick fix for the purpose of my federated query, but I am wondering if the more consistent solution would be to map http://identifiers.org/entrez.gene to http://identifiers.org/ncbigene in the IMS. In the end entrez.gene remains a resolvable namespace.
Yes, agree. Think the two should be mapped to each other. Is there a way to do that benefits from the fact that while the resource names are different it actually is the same resources and the IDs are actually the same.
For the sake of completeness. I talked to Egon and the issue with bridgeDb appears to be fixed already. That means that the BridgeDB already contains the ncbigene namescape in its datasource table. So on our next releases the uri's in WPRDF will be consistent.
I'll add Christian to the ticket so we can get the http://identifiers.org/entrez.gene namespace added to the IMS
Appears that http://identifiers.org/entrez.gene/$id and http://identifiers.org/entrez.gene/$id http://identifiers.org/ncbigene/$id are actually the same thing as http://identifiers.org/entrez.gene/100010 forwards to http://identifiers.org/ncbigene/100010.
That being the save case then yes Christine fix is the correct one and also VERY easy to apply to older versions of the IMS as well.
Is this fix in use by the current API 1.4 version? If yes, it didn't solve the problem unfortunately.
Looks like the fix was done incorrectly. See http://openphacts.cs.man.ac.uk:9090/QueryExpander/dataSource/L Notice the incorrect http://info.identifiers.org/entrez.gene/%24id/$id
I pushed a bug fix to https://github.com/openphacts/deployment/tree/master/IMSandExpander/Ops1.4.1.1
Which is the exact same war with just a different config file inside.
This is still an open issue for 1.4.
I can confirm the latest WPRDF only uses the http://identifiers.org/ncbigene/100010 pattern.
Fixed - the Pathway by Target calls were only looking for wp:GeneProduct now added wp:Protein
I also tested and it works.
I just checked... there is also the wp:Complex which is like the difference between Protein and Target in ChEBML, but there are only very few nodes in WikiPathways linked to the protein/gene identifier (basically N=1).
Yes, that one of these things that can happen on a community edited site like WikiPathways. Best way to solve things like that is to just go to WikiPathways and solve it there. Complexes normally should not have (single) protein IDs.
Basically we have 3 datanode that are supposed to have target information. gene product, RNA amd protein. I understood protein is now added to API calls. For some reason RNA is not in the RDF. We are checking why that is the case. Possibly that data type just was never used in a pathway in the curated collection.
This brings up another issue. Basically the RDF converter should be created based on the data model, not on the actual GPML. Because some things might not be used now but could be used later. The same is actually true for generating API calls.
From support portal http://support.openphacts.org/helpdesk/tickets/58.
It seems for some gene IDs you can't find a pathway with pathways/byTarget, although they were identified with a pathway/getTargets call (tested with 1.3 and 1.4).
For example: http://identifiers.org/wikipathways/WP49 finds among others the following genes, but none of them returns the pathway again. http://identifiers.org/ensembl/ENSG00000174775 -> 404 error http://identifiers.org/ncbigene/5605 -> returns several pathways, but not WP49 http://identifiers.org/ncbigene/2185 -> returns pathways, but not WP49 http://identifiers.org/ncbigene/207 -> returns several pathways, but not WP49