openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

pathway finds target, but target doesn't find pathway #131

Closed danidi closed 9 years ago

danidi commented 10 years ago

From support portal http://support.openphacts.org/helpdesk/tickets/58.

It seems for some gene IDs you can't find a pathway with pathways/byTarget, although they were identified with a pathway/getTargets call (tested with 1.3 and 1.4).

For example: http://identifiers.org/wikipathways/WP49 finds among others the following genes, but none of them returns the pathway again. http://identifiers.org/ensembl/ENSG00000174775 -> 404 error http://identifiers.org/ncbigene/5605 -> returns several pathways, but not WP49 http://identifiers.org/ncbigene/2185 -> returns pathways, but not WP49 http://identifiers.org/ncbigene/207 -> returns several pathways, but not WP49

Chris-Evelo commented 10 years ago

I think this is the same IMS upon integration, IMS upoe query issue we have seen before. When you query for the genes in the pathway the gene are mapped to the identifier scheme you ask for (as they should be), when you query with a specific gene that apparently does happen, so if the gene in the pathway was annotated with another type of ID scheme the pathway isn't found.

ChristineChichester commented 10 years ago

Daniela and I looked a little into this and we found that querying the WP endpoint with a gene PTK2B (http://identifiers.org/ncbigene/2185) gave a list of pathways (list at end of message)that the DID NOT include the WP49, although the gene is found in the pathway. Similarly querying the WP SPARQL endpoint with JAK3 (http://identifiers.org/ncbigene/3718) returns no pathways although this gene is found in WP49 (IL-2 signaling pathway). @Andra can you investigate this a little in the WP RDF?

List of pathways for PTK2B http://rdf.wikipathways.org/Pathway/WP794_r67074 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP978_r67181 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP2263_r67609 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP973_r67373 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP855_r67372 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP1025_r67068 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP1091_r67366 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP1144_r67069 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP908_r67075 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP747_r67371 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP1096_r68551 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP313_r69027 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP313_r69027/group/c1e29 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP860_r69215 "PTK2B"@en http://rdf.wikipathways.org/Pathway/WP752_r68850 "PTK2B"@en

danidi commented 10 years ago

I fear the issue with having a different backpagehead value than the protein name is not (fully) explaining this issue. I tried with 6772 (Stat1), but I can't find the IL-2 signaling pathway with this as well (tested with the OPS API). It rather seems that a whole pathway can't be found over any of its targets, while the targets can find other pathways.

Here is the list I got from the initial support ticket (Gene IDs with expected pathways). Maybe the pathways have something in common?

41 DNA_Replication

6850 B_Cell_Receptor_Signaling_Pathway IL-2_Signaling_pathway IL-3_Signaling_Pathway IL-5_signaling_pathway RANKL/RANK_Signaling_Pathway Regulation_of_toll-like_receptor_signaling_pathway

3718 IL-2_Signaling_pathway IL-4_signaling_pathway IL-7_signaling_pathway IL-9_signaling_pathway

695 B_Cell_Receptor_Signaling_Pathway IL-5_signaling_pathway Kit_receptor_signaling_pathway Regulation_of_toll-like_receptor_signaling_pathway

10498 Androgen_receptor_signaling_pathway

11035 TNF_alpha_Signaling_Pathway

10987 TGF_beta_Signaling_Pathway

1457 TNF_alpha_Signaling_Pathway

7329 Androgen_receptor_signaling_pathway TGF_beta_Signaling_Pathway

92 B_Cell_Receptor_Signaling_Pathway

23028 Androgen_receptor_signaling_pathway

4353 Folate_Metabolism Selenium_Pathway Vitamin_B12_Metabolism

1509 Prolactin_Signaling_Pathway

23410 Energy_Metabolism

danidi commented 10 years ago

From the OPS API.

andrawaag commented 10 years ago

Hi, I can't reproduce the issue on the RDF side. First I used the following SPARQL query to verify that http://identifiers.org/ncbigene/2185 is part of WP49. It is.

SELECT DISTINCT ?gpIdentifier WHERE { ?pathway ?p http://identifiers.org/wikipathways/WP49 . ?gp dcterms:isPartOf ?pathway . ?gp a gpml:DataNode . ?gp dc:identifier ?gpIdentifier}

The other way around asking for pathways with http://identifiers.org/ncbigene/2185 also returns WP49 as a pathway containing http://identifiers.org/ncbigene/2185.

SELECT DISTINCT ?ioWP WHERE { ?pathway dc:identifier ?ioWP . ?gp dcterms:isPartOf ?pathway . ?gp a gpml:DataNode . ?gp dc:identifier http://identifiers.org/ncbigene/2185}

Both queries can be executed on http://sparql.wikipathways.org

Doing the same for "http://identifiers.org/ncbigene/3718" does return the expected results.

I am stuck here because the IMS isn't relevant since http://identifiers.org/ncbigene/2185 is the original identifier in the pathway. There is no need for mappings.

@ChristineChichester could you sent me the queries you used?

pgroth commented 10 years ago

It could be that the RDF is not up-to-date in the platform.

Paul

On Tue, May 20, 2014 at 12:08 PM, andrawaag notifications@github.comwrote:

Hi, I can't reproduce the issue on the RDF side. First I used the following SPARQL query to verify that http://identifiers.org/ncbigene/2185 is part of WP49. It is.

SELECT DISTINCT ?gpIdentifier WHERE { ?pathway ?p http://identifiers.org/wikipathways/WP49 . ?gp dcterms:isPartOf ?pathway . ?gp a gpml:DataNode . ?gp dc:identifier ?gpIdentifier}

The other way around asking for pathways with http://identifiers.org/ncbigene/2185 also returns WP49 as a pathway containing http://identifiers.org/ncbigene/2185.

SELECT DISTINCT ?ioWP WHERE { ?pathway dc:identifier ?ioWP . ?gp dcterms:isPartOf ?pathway . ?gp a gpml:DataNode . ?gp dc:identifier http://identifiers.org/ncbigene/2185}

Both queries can be executed on http://sparql.wikipathways.org

Doing the same for "http://identifiers.org/ncbigene/3718" does return the expected results.

I am stuck here because the IMS isn't relevant since http://identifiers.org/ncbigene/2185 is the original identifier in the pathway. There is no need for mappings.

@ChristineChichester https://github.com/ChristineChichester could you sent me the queries you used?

— Reply to this email directly or view it on GitHubhttps://github.com/openphacts/GLOBAL/issues/131#issuecomment-43612334 .


Dr. Paul Groth (p.t.groth@vu.nl) http://www.few.vu.nl/~pgroth/ Assistant Professor

danidi commented 10 years ago

But if it is a version problem (e.g. the protein/pathway connection is not in the OPS RDF), then I wouldn't expect to find the target via the pathway. But this way the connection can be retrieved.

andrawaag commented 10 years ago

I just had an offline discussion with Christine on this. I ran into a similar problem when running a federated query between Wikipathways, Uniprot and DisGenet. Some links weren't simply made although I was pretty sure the data was in.

It appears that there is an issue with the namespace for Entrez Gene. There are actually two and both aren't mapped to each other. (see: https://beta.openphacts.org/1.4/mapUri?app_id=18983b12&app_key=c99cf43da48a1a2f9069651fe6be7c06&Uri=http%3A%2F%2Fidentifiers.org%2Fncbigene%2F2185) Both http://identifiers.org/ncbigene and http://identifiers.org/entrez.gene resolve to an entry in Entrez gene. The latter being updated somewhere around 2012, deprecating the entrez.gene namespace

In RDF of Wikipathways both name spaces are used. There are two predicates for an identifier being dc:identifier and wp:bdbEntrezGene. dc:identifier points to the original identifier added by the pathway curator, wp:bdbEntrezGene is a normalised identifier (mapped through BridgeDB). Since bridgeDb still uses the older (meriam-based) namespace, we ended up with the Entrez.gene based uri.

To know whether or not this also causes the observed difference in recall we need to know which predicate is used to capture the uri of a concept in the api calls.

I am working on a quick fix for the purpose of my federated query, but I am wondering if the more consistent solution would be to map http://identifiers.org/entrez.gene to http://identifiers.org/ncbigene in the IMS. In the end entrez.gene remains a resolvable namespace.

Chris-Evelo commented 10 years ago

Yes, agree. Think the two should be mapped to each other. Is there a way to do that benefits from the fact that while the resource names are different it actually is the same resources and the IDs are actually the same.

andrawaag commented 10 years ago

For the sake of completeness. I talked to Egon and the issue with bridgeDb appears to be fixed already. That means that the BridgeDB already contains the ncbigene namescape in its datasource table. So on our next releases the uri's in WPRDF will be consistent.

ChristineChichester commented 10 years ago

I'll add Christian to the ticket so we can get the http://identifiers.org/entrez.gene namespace added to the IMS

Christian-B commented 10 years ago

Appears that http://identifiers.org/entrez.gene/$id and http://identifiers.org/entrez.gene/$id http://identifiers.org/ncbigene/$id are actually the same thing as http://identifiers.org/entrez.gene/100010 forwards to http://identifiers.org/ncbigene/100010.

That being the save case then yes Christine fix is the correct one and also VERY easy to apply to older versions of the IMS as well.

Christian-B commented 10 years ago

Fixed on http://openphacts.cs.man.ac.uk:9090/QueryExpander War at https://github.com/openphacts/deployment/blob/master/IMSandExpander/Ops1.4.1/QueryExpander.war

danidi commented 10 years ago

Is this fix in use by the current API 1.4 version? If yes, it didn't solve the problem unfortunately.

Christian-B commented 10 years ago

Looks like the fix was done incorrectly. See http://openphacts.cs.man.ac.uk:9090/QueryExpander/dataSource/L Notice the incorrect http://info.identifiers.org/entrez.gene/%24id/$id

Christian-B commented 10 years ago

I pushed a bug fix to https://github.com/openphacts/deployment/tree/master/IMSandExpander/Ops1.4.1.1

Which is the exact same war with just a different config file inside.

danidi commented 10 years ago

This is still an open issue for 1.4.

egonw commented 9 years ago

I can confirm the latest WPRDF only uses the http://identifiers.org/ncbigene/100010 pattern.

antonisloizou commented 9 years ago

Fixed - the Pathway by Target calls were only looking for wp:GeneProduct now added wp:Protein

ChristineChichester commented 9 years ago

I also tested and it works.

egonw commented 9 years ago

I just checked... there is also the wp:Complex which is like the difference between Protein and Target in ChEBML, but there are only very few nodes in WikiPathways linked to the protein/gene identifier (basically N=1).

Chris-Evelo commented 9 years ago

Yes, that one of these things that can happen on a community edited site like WikiPathways. Best way to solve things like that is to just go to WikiPathways and solve it there. Complexes normally should not have (single) protein IDs.

Basically we have 3 datanode that are supposed to have target information. gene product, RNA amd protein. I understood protein is now added to API calls. For some reason RNA is not in the RDF. We are checking why that is the case. Possibly that data type just was never used in a pathway in the curated collection.

This brings up another issue. Basically the RDF converter should be created based on the data model, not on the actual GPML. Because some things might not be used now but could be used later. The same is actually true for generating API calls.