openphacts / IdentityMappingService

The Identity Mapping Service to combine BridgeDB and the Validator
1 stars 3 forks source link

Support SureChembl external link URI patterns #14

Open stain opened 9 years ago

stain commented 9 years ago

From https://wiki.openphacts.org/index.php/SureChEMBL

Compounds are assigned SureChEMBL identifiers as used in the SureChEMBL interface and download files. Please note these identifiers have no relation to ChEMBL identifiers, but the UniChem system can be used to cross-reference the two. The URIs provided take the following form:

http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL15064

Please note that SureChEMBL molecules are not yet loaded in the Open PHACTS chemical registry, so cannot currently be retrieved via OCRS IDs.

Targets are identified by HGNC symbols with URIs of the form:

http://rdf.ebi.ac.uk/resource/surechembl/target/FDFT1

Mappings from HGNC symbols to other gene/protein identifiers are available via the IMS through Ensembl linksets.

Diseases are identified by MeSH disease identifiers with URIs of the form:

http://rdf.ebi.ac.uk/resource/surechembl/indication/D009765

Mappings to UMLS and Disease Ontology (DO) are available via DisGeNET link sets in the IMS. It should be noted that not all MeSH identifiers currently map to a disease in DO.

Patents are uniquely identified by patent numbers in a defined format. This should be the patent office code (e.g., EP, WO or US) followed by a hyphen, the patent number (no leading zeros), another hyphen and finally the kind code (e.g., A1, B2). The SureChEMBL interface provides a service to standardise and resolve other formats of patent numbers.

URIs take the form:

http://rdf.ebi.ac.uk/resource/surechembl/patent/EP-1339685-A2

stain commented 9 years ago

I think we need:

stain commented 9 years ago

Perhaps @agaulton can check what is the URI pattern for identifiers that DON'T match MeSH or HGNC to ensure they don't accidentally get mapped by IMS - e.g. don't match their regular expressions.

agaulton commented 9 years ago

On the mesh identifiers.org URIs:

Anyway, this probably doesn't matter too much if we're only using it as an ID.

In the SureChEMBL dataset, MeSH disease IDs match the pattern: ^D0[0-9]{5}$ MeSH supplementary concept terms match the pattern: ^C5[0-9]{5}$ Custom SciBite disease IDs match the pattern: ^DX[0-9]{5}$

So they could be distinguished, but not by the identifiers.org pattern.

agaulton commented 9 years ago

On the gene symbols, we have ~10 of these that are not genuine HGNC symbols. These have the form: ^_[A-Za-z0-9]+$

agaulton commented 9 years ago

On the unichem chembl-surechembl linkset, we already have one as part of the chembl_20 release that should work: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/20.1/chembl_20.1_unichem.ttl.gz

SureChEMBL URIs are different (since this pre-dated the SureChEMBL RDF) so pattern would need mapping in the IMS

agaulton commented 9 years ago

Correction - this unichem data is not currently a link set, but could be converted...

agaulton commented 9 years ago

Update - Nick has fixed the issue with identifiers.org so the mesh links now work again. He has also made the regex less permissive: ^(C|D)0\d{5}$ So it should be possible to distinguish the SciBite IDs now.

danidi commented 8 years ago

Compound mappings to OCRS are now available, e.g. for http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL15064. Disease and Gene_ID patterns seem to be missing still.