Closed lotharwissler closed 9 years ago
I agree, it's confusing and needs to be resolved. There is some work going on as part of #198. Maybe @ChristineChichester can comment (@StefanSenger).
Hi! Since the latest update of the chemistry search a few days ago (to fix another issue) the SMILES to URL returns the Chemspider ID of the compound (which is then merged to the ops URL). They are working to fix this bug, on principle both calls should return the same result (with MatchType=0).
As it seems that this will only be possible with the next release of the chemistry registration service, we are left with two options @antonisloizou mentioned here https://github.com/openphacts/GLOBAL/issues/198:
The 3 calls 'Chemical Structure Conversion: Inchi to URL', 'Chemical Structure Conversion: InchiKey to URL', and 'Chemical Structure Conversion: SMILES to URL' are updated so that the chemspider prefix is returned over the OCRS one; i.e. return http://rdf.chemspider.com/84990 (which is Erythrose ) instead of http://ops.rsc.org/OPS84990 (which is not Erythrose)
OR
The 3 calls 'Chemical Structure Conversion: Inchi to URL', 'Chemical Structure Conversion: InchiKey to URL', and 'Chemical Structure Conversion: SMILES to URL' need to be removed from our API as they currently produce wrong URIs
I'm favouring the option to return chemspider uris rather than to remove the calls. @lotharwissler would it be a problem for the CBN if the call returns a chemspider URI instead?
Ken said the next release would be mid-November.
Is it much effort to change back to using the chemspider URI and then change again once RSC has fixed the issue? Would this change affect the apps? I assume removing the calls is not really a real option but leaving it broken until November is not really great either....
I guess this should be on the agenda for the TTF on 23/10 - I think that as has been pointed out, we have to do something about these 3 methods and so i would want to review the option of returning CSIDs at the TTF will solve the problem for now and if we're agreed on that i think we put a message out to all of the users that we intend changing the API and see what response we get.
@danidi Yes, in principle any type or URI is fine for us. I'd just like to see that the returned URI is part of the OPS dataset, i.e. returns results for compound info and pharmacology.
As discussed in telcon 2014-10-23, this seems to be caused by OPS identifiers changing in http://ops.rsc.org/. The API for structure/
will have cached in the RDF store the previous response - which now gives either a 404, or in some cases a totally different molecule.
The caches in the RDF store seem to not time out.. is that correct?
I have added a patch to openphacts/OPS_LinkedDataAPI that changes the hashing of the cached graph from crc32
to sha256
- this removes the risk of cache-key collisions. (74% chance after 50.000 requests with crc32).
I do not however believe that this particular issue was caused by such a collision - because then you would expect in the cached graph to have a totally different OPS identifier AND a different SMILES string. Checking with sparql against http://ops2.few.vu.nl:8890/sparql/ does however reveal that the "wrong" value is indeed stored in the cache - so it would have been what ops.rcs.org replied at some point earlier.
(No metadata is stored for these cache graphs, so I can't tell you when)
I think the issue with changing identfiers was just discussed as it could appear with updating to the next ops.rsc version. The identifiers now are still the same (the exact search for example still returns the right ones). The ops.rsc API returns now a different type of identifier (the chemspider one) with the InChI to URI method, and this number is incorrectly added to http://ops.rsc.org/OPS. This worked fine on 1.4 until at least the 7th of October. The 404 in compound information is retrieved because for this incorrectly generated URI there is no information available. The SMILES to URI call shouldn't give a 404 (unless you query for a compound which is not available), but will always return a wrong compound (but which would be the right identifier if you add the Chemspider URI part at the beginning). I'm not familiar at all with the caching, but I don't think that causes the issue here.
No, the caching is not at fault.
The difference is as you say - the /structure
service effectively calls http://ops.rsc.org/api/v1/JSON.ashx?op=ConvertTo&convertOptions.Direction=Smiles2InChi&convertOptions.Text=CC%28=O%29Oc1ccccc1C%28=O%29O when called with the Aspirin SMILES CC(=O)Oc1ccccc1C(=O)O
- the result looks sane to my untrained eye.
The subsequent call http://ops.rsc.org/api/v1/JSON.ashx?op=ConvertTo&convertOptions.Direction=InChi2CSID&convertOptions.Text=InChI=1S/C9H8O4/c1-6%2810%2913-8-5-3-2-4-7%288%299%2811%2912/h2-5H,1H3,%28H,11,12%29 returns the chemspider ID 2157, not the ops.rcs.org ID 403534. I assume at one point the two identifier schemes were the same - in fact the Explorer has called this "Chemspider ID" - even though now it no longer is. It is just an "OPS RCS ID".
Now this change under the hood makes a worry to me - how can we trust that the OPS id 403534 will be 403534 in the future and not 123877?
Assuming we want to keep using the OPS IDs, and as long as there is no InChi2OPSID convert option at ops.rcs.org- I guess the only fix is for the structure
web service to be changed to use the exact structure match instead - as that returns the OPS identifier rather than the Chemspider ID. Alternatively - is there a URI base for Chemspider IDs we could reliably return instead?
Hello,
There is also the problem that the explorer still lists these IDs as from Chemspider, see http://explorer2.openphacts.org/compounds?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2Fdd758846-1dac-4f0d-a329-06af9a7fa413 for Aspirin. The UI reports Chemspider ID: http://ops.rsc.org/OPS40353. Obviously it's not chemspider any more since that is still http://www.chemspider.com/Chemical-Structure.2157.html for Aspirin. So what exactly is this OPS ID. Is it just the row ID in the database? Or is it generated somehow? Can we guarantee that OPS40353 will still be OPS40353 in the future? If someone publishes a paper using the Open PHACTS data they need that assurance.
Cheers,
Ian
On 24 October 2014 10:05, Stian Soiland-Reyes notifications@github.com wrote:
No, the caching is not at fault.
The difference is as you say - the /structure service effectively calls http://ops.rsc.org/api/v1/JSON.ashx?op=ConvertTo&convertOptions.Direction=Smiles2InChi&convertOptions.Text=CC%28=O%29Oc1ccccc1C%28=O%29O when called with the Aspirin SMILES CC(=O)Oc1ccccc1C(=O)O - the result looks sane to my untrained eye.
The subsequent call http://ops.rsc.org/api/v1/JSON.ashx?op=ConvertTo&convertOptions.Direction=InChi2CSID&convertOptions.Text=InChI=1S/C9H8O4/c1-6%2810%2913-8-5-3-2-4-7%288%299%2811%2912/h2-5H,1H3,%28H,11,12%29 returns the chemspider ID 2157 http://www.chemspider.com/Chemical-Structure.2157.html, not the ops.rcs.org ID 403534 http://ops.rsc.org/Compounds/Get/403534. I assume at one point the two identifier schemes were the same - in fact the Explorer has called this "Chemspider ID" http://explorer2.openphacts.org/compounds?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2Fdd758846-1dac-4f0d-a329-06af9a7fa413
- even though now it no longer is. It is just an "OPS RCS ID".
Now this change under the hood makes a worry to me - how can we trust that the OPS id 403534 will be 403534 in the future and not 123877?
Assuming we want to keep using the OPS IDs, and as long as there is no InChi2OPSID convert option at ops.rcs.org http://ops.rsc.org/api/v1/JSON.ashx#EDirection- I guess the only fix is for the structure web service to be changed to use the exact structure match instead - as that returns the OPS identifier rather than the Chemspider ID. Alternatively - is there a URI base for Chemspider IDs we could reliably return instead?
— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/205#issuecomment-60362278.
Ian Dunlop myGrid Team School of Computer Science University of Manchester
I think what we can guarentee is that OPS40353 will not be re-used. However, OPS40353 is generated from a structure+processing_parameters... So what we're trying to work out is what happens when the same structure+slight_different_processing_parameters are passed. This may result in a different inchi in which case a new OPS ID will be minted. But the relationship of "the old" ID to this new one needs to be clarified
OK.. what if we just change the /structure
SMILES-to-URL webservice to return the truthful URI for chemspider then? That is afterall what the underlying ops.rcs.org service is meant to return. E.g. http://www.chemspider.com/2157
This works fine with say /compound
which can give you the "new" OPS identifier http://ops.rsc.org/OPS403534 as well.
{
"format": "linked-data-api",
"version": "1.4",
"result": {
"_about": "https://beta.openphacts.org/1.4/compound?uri=http%3A%2F%2Fwww.chemspider.com%2F2157&app_id=161aeb7d&app_key=cffc292726627ffc50ece1dccd15aeaf",
"definition": "https://beta.openphacts.org/api-config",
"extendedMetadataVersion": "https://beta.openphacts.org/1.4/compound?uri=http%3A%2F%2Fwww.chemspider.com%2F2157&app_id=161aeb7d&app_key=cffc292726627ffc50ece1dccd15aeaf&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite",
"primaryTopic": {
"_about": "http://www.chemspider.com/2157",
"exactMatch": [
{
"_about": "http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL25",
"mw_freebase": 180.16,
"inDataset": "http://www.ebi.ac.uk/chembl",
"type": "http://rdf.ebi.ac.uk/terms/chembl#SmallMolecule"
},
{
"_about": "http://www.conceptwiki.org/concept/dd758846-1dac-4f0d-a329-06af9a7fa413",
"inDataset": "http://www.conceptwiki.org",
"prefLabel_en": "Aspirin",
"prefLabel": "Aspirin"
},
{
"_about": "http://ops.rsc.org/OPS403534",
"inDataset": "http://ops.rsc.org",
"hba": 4,
"hbd": 1,
"inchi": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)",
"inchikey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
"logp": 1.19,
"molformula": "C9H8O4",
"molweight": 180.157,
"psa": 63.6,
"ro5_violations": 0,
"rtb": 3,
"smiles": "CC(=O)OC1=CC=CC=C1C(=O)O"
},
{
"_about": "http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00945",
"inDataset": "http://linkedlifedata.com/resource/drugbank",
"biotransformation": "Aspirin is rapidly hydrolyzed primarily in the liver to salicylic acid, which is conjugated with glycine (forming salicyluric acid) and glucuronic acid and excreted largely in the urine.",
"description": "The prototypical analgesic used in the treatment of mild to moderate pain. It has anti-inflammatory and antipyretic properties and acts as an inhibitor of cyclooxygenase which results in the inhibition of the biosynthesis of prostaglandins. Aspirin also inhibits platelet aggregation and is used in the prevention of arterial and venous thrombosis. (From Martindale, The Extra Pharmacopoeia, 30th ed, p5)",
"drugType": [
"smallMolecule",
"approved"
],
"genericName": "Aspirin",
"meltingPoint": "135 oC (boiling point 140 oC)",
"proteinBinding": "High (99.5%) to albumin. Decreases as plasma salicylate concentration increases, with reduced plasma albumin concentration or renal dysfunction, and during pregnancy.",
"toxicity": "Oral, mouse: LD50 = 250 mg/kg; Oral, rabbit: LD50 = 1010 mg/kg; Oral, rat: LD50 = 200 mg/kg. Effects of overdose include: tinnitus, abdominal pain, hypokalemia, hypoglycemia, pyrexia, hyperventilation, dysrhythmia, hypotension, hallucination, renal failure, confusion, seizure, coma, and death."
}
],
"isPrimaryTopicOf": "https://beta.openphacts.org/1.4/compound?uri=http%3A%2F%2Fwww.chemspider.com%2F2157&app_id=161aeb7d&app_key=cffc292726627ffc50ece1dccd15aeaf"
}
}
}
i am in the process of discussing with the RSC as to whether the original behaviour can be restored otherwise this is the route we will need to take.
update - rsc are testing a fix as we speak
The test instance of the search now correctly retrieves OPS1576568 (this is not yet available from the Open PHACTS API).
Hi Lothar! The issue should be solved now on 1.4. Can you please retest?
I can confirm that the issue is solved for all the cases I tested.
Hi there,
I've been using SMILES to URL to find OPS compounds for SMILES given by a user. Now I have noticed that this method can fail to find a compound which really should be found (because it is a CHEMBL SMILES).
SMILES to URL returns http://ops.rsc.org/OPS8019313 for which compound_info returns 404
Exact Match with MatchType=0 returns http://ops.rsc.org/OPS1576568 for which I do get a compound_info (but no reference to OPS8019313)
I suspect that I do not get a compound_info because OPS8019313 is a virtual compound (known bug?). Is SMILES to URL thus a bad choice for my user work flow so that I should replace it with ExactMatch? Why do the two methods return different results, and when is SMILES to URL reliably applicable?