Open stain opened 9 years ago
I think we need:
Perhaps @agaulton can check what is the URI pattern for identifiers that DON'T match MeSH or HGNC to ensure they don't accidentally get mapped by IMS - e.g. don't match their regular expressions.
^[A-Za-z0-9]+$
^[A-Za-z-0-9_]+(\@)?$
On the mesh identifiers.org URIs:
Anyway, this probably doesn't matter too much if we're only using it as an ID.
In the SureChEMBL dataset, MeSH disease IDs match the pattern: ^D0[0-9]{5}$ MeSH supplementary concept terms match the pattern: ^C5[0-9]{5}$ Custom SciBite disease IDs match the pattern: ^DX[0-9]{5}$
So they could be distinguished, but not by the identifiers.org pattern.
On the gene symbols, we have ~10 of these that are not genuine HGNC symbols. These have the form: ^_[A-Za-z0-9]+$
On the unichem chembl-surechembl linkset, we already have one as part of the chembl_20 release that should work: ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/20.1/chembl_20.1_unichem.ttl.gz
SureChEMBL URIs are different (since this pre-dated the SureChEMBL RDF) so pattern would need mapping in the IMS
Correction - this unichem data is not currently a link set, but could be converted...
Update - Nick has fixed the issue with identifiers.org so the mesh links now work again. He has also made the regex less permissive: ^(C|D)0\d{5}$ So it should be possible to distinguish the SciBite IDs now.
Compound mappings to OCRS are now available, e.g. for http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL15064. Disease and Gene_ID patterns seem to be missing still.
From https://wiki.openphacts.org/index.php/SureChEMBL
http://rdf.ebi.ac.uk/resource/surechembl/molecule/SCHEMBL15064
Please note that SureChEMBL molecules are not yet loaded in the Open PHACTS chemical registry, so cannot currently be retrieved via OCRS IDs.
Targets are identified by HGNC symbols with URIs of the form:
http://rdf.ebi.ac.uk/resource/surechembl/target/FDFT1
Mappings from HGNC symbols to other gene/protein identifiers are available via the IMS through Ensembl linksets.
Diseases are identified by MeSH disease identifiers with URIs of the form:
http://rdf.ebi.ac.uk/resource/surechembl/indication/D009765
Mappings to UMLS and Disease Ontology (DO) are available via DisGeNET link sets in the IMS. It should be noted that not all MeSH identifiers currently map to a disease in DO.
Patents are uniquely identified by patent numbers in a defined format. This should be the patent office code (e.g., EP, WO or US) followed by a hyphen, the patent number (no leading zeros), another hyphen and finally the kind code (e.g., A1, B2). The SureChEMBL interface provides a service to standardise and resolve other formats of patent numbers.
URIs take the form:
http://rdf.ebi.ac.uk/resource/surechembl/patent/EP-1339685-A2