openphacts / IdentityMappingService

The Identity Mapping Service to combine BridgeDB and the Validator
1 stars 3 forks source link

Can't map from NCBI Uniprot proteins #17

Open stain opened 8 years ago

stain commented 8 years ago

Looking up IMS mapping for URIs like http://www.ncbi.nlm.nih.gov/protein/P62158 do not return URIs like http://purl.uniprot.org/uniprot/P62158 or its transitive http://bio2rdf.org/drugbank:BE0000418 even though the lookups of the uniprot or drugbank identifier in reverse works fine and include the ncbi protein pattern as part of the uniprot mapping

This is caused by both sources for Uniprot and NCBI Protein claim to handle the http://www.ncbi.nlm.nih.gov/protein/$id pattern.

We do not currently have any NCBI Protein mappings, so perhaps a workaround is to disable the NCBI Protein source?

danidi commented 8 years ago

As we didn't expose the identifiers.org mappings for NCBI yet, I think it shouldn't create issues to disable the NCBI source. In general, it seems that NCBI allows several different identifiers (not just from uniprot), e.g. http://www.ncbi.nlm.nih.gov/protein/CAA71118.1 from genbank. This might make the inclusion of this datasource into the IMS more difficult.

stain commented 8 years ago

Namespace overlapping, but without CURIEs.. not so nice, NCBI.

So we are not guaranteed that such other ncbi protein URI do not match the uniprot regular expression, and so could be wrongly mapped back again to uniprot by IMS if used as an input.

http://www.ncbi.nlm.nih.gov/protein/$id is not used in any of the linksets, so it might also be OK to disable it from the Uniprot datasource, and thus remove it from the results when looking up a uniprot identifier. Would this be a meaningful resolution, or do we want to keep http://www.ncbi.nlm.nih.gov/protein/$id as an outgoing link from /mapURI?

I am checking to ensure this pattern is not used by any of the RDF in the cache - but it could still be a useful URI pattern to support as NCBI services are widely used in bioinformatics.

Chris-Evelo commented 8 years ago

I think the problem here is that NCBI doesn't want to link to UniProt or other external resources directly but to there own instance of a protein which they just happen to have taken from UniProt. I think that we should probably not go along and make sure we end up at UniProt, unless somebody explicitly searches for an NCBI link . But that means we should not depend on NCBI to resolve that issue. (Maybe this is what @stain meant?