openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Smiles to URL vs. Exact Match and missing compound_info #205

Closed lotharwissler closed 9 years ago

lotharwissler commented 9 years ago

Hi there,

I've been using SMILES to URL to find OPS compounds for SMILES given by a user. Now I have noticed that this method can fail to find a compound which really should be found (because it is a CHEMBL SMILES).

Input SMILES: CCCC(CCC)N1CCN2C(=O)N(c3ccc(OC)cc3)c4nc(C)cc1c24 

SMILES to URL returns http://ops.rsc.org/OPS8019313 for which compound_info returns 404

Exact Match with MatchType=0 returns http://ops.rsc.org/OPS1576568 for which I do get a compound_info (but no reference to OPS8019313)

I suspect that I do not get a compound_info because OPS8019313 is a virtual compound (known bug?). Is SMILES to URL thus a bad choice for my user work flow so that I should replace it with ExactMatch? Why do the two methods return different results, and when is SMILES to URL reliably applicable?

StefanSenger commented 9 years ago

I agree, it's confusing and needs to be resolved. There is some work going on as part of #198. Maybe @ChristineChichester can comment (@StefanSenger).

danidi commented 9 years ago

Hi! Since the latest update of the chemistry search a few days ago (to fix another issue) the SMILES to URL returns the Chemspider ID of the compound (which is then merged to the ops URL). They are working to fix this bug, on principle both calls should return the same result (with MatchType=0).

danidi commented 9 years ago

As it seems that this will only be possible with the next release of the chemistry registration service, we are left with two options @antonisloizou mentioned here https://github.com/openphacts/GLOBAL/issues/198:

The 3 calls 'Chemical Structure Conversion: Inchi to URL', 'Chemical Structure Conversion: InchiKey to URL', and 'Chemical Structure Conversion: SMILES to URL' are updated so that the chemspider prefix is returned over the OCRS one; i.e. return http://rdf.chemspider.com/84990 (which is Erythrose ) instead of http://ops.rsc.org/OPS84990 (which is not Erythrose)

OR

The 3 calls 'Chemical Structure Conversion: Inchi to URL', 'Chemical Structure Conversion: InchiKey to URL', and 'Chemical Structure Conversion: SMILES to URL' need to be removed from our API as they currently produce wrong URIs

I'm favouring the option to return chemspider uris rather than to remove the calls. @lotharwissler would it be a problem for the CBN if the call returns a chemspider URI instead?

ChristineChichester commented 9 years ago

Ken said the next release would be mid-November.

Is it much effort to change back to using the chemspider URI and then change again once RSC has fixed the issue? Would this change affect the apps? I assume removing the calls is not really a real option but leaving it broken until November is not really great either....

leeharland commented 9 years ago

I guess this should be on the agenda for the TTF on 23/10 - I think that as has been pointed out, we have to do something about these 3 methods and so i would want to review the option of returning CSIDs at the TTF will solve the problem for now and if we're agreed on that i think we put a message out to all of the users that we intend changing the API and see what response we get.

lotharwissler commented 9 years ago

@danidi Yes, in principle any type or URI is fine for us. I'd just like to see that the returned URI is part of the OPS dataset, i.e. returns results for compound info and pharmacology.

stain commented 9 years ago

As discussed in telcon 2014-10-23, this seems to be caused by OPS identifiers changing in http://ops.rsc.org/. The API for structure/ will have cached in the RDF store the previous response - which now gives either a 404, or in some cases a totally different molecule.

The caches in the RDF store seem to not time out.. is that correct?

I have added a patch to openphacts/OPS_LinkedDataAPI that changes the hashing of the cached graph from crc32 to sha256 - this removes the risk of cache-key collisions. (74% chance after 50.000 requests with crc32).

I do not however believe that this particular issue was caused by such a collision - because then you would expect in the cached graph to have a totally different OPS identifier AND a different SMILES string. Checking with sparql against http://ops2.few.vu.nl:8890/sparql/ does however reveal that the "wrong" value is indeed stored in the cache - so it would have been what ops.rcs.org replied at some point earlier.

(No metadata is stored for these cache graphs, so I can't tell you when)

danidi commented 9 years ago

I think the issue with changing identfiers was just discussed as it could appear with updating to the next ops.rsc version. The identifiers now are still the same (the exact search for example still returns the right ones). The ops.rsc API returns now a different type of identifier (the chemspider one) with the InChI to URI method, and this number is incorrectly added to http://ops.rsc.org/OPS. This worked fine on 1.4 until at least the 7th of October. The 404 in compound information is retrieved because for this incorrectly generated URI there is no information available. The SMILES to URI call shouldn't give a 404 (unless you query for a compound which is not available), but will always return a wrong compound (but which would be the right identifier if you add the Chemspider URI part at the beginning). I'm not familiar at all with the caching, but I don't think that causes the issue here.

stain commented 9 years ago

No, the caching is not at fault.

The difference is as you say - the /structure service effectively calls http://ops.rsc.org/api/v1/JSON.ashx?op=ConvertTo&convertOptions.Direction=Smiles2InChi&convertOptions.Text=CC%28=O%29Oc1ccccc1C%28=O%29O when called with the Aspirin SMILES CC(=O)Oc1ccccc1C(=O)O - the result looks sane to my untrained eye.

The subsequent call http://ops.rsc.org/api/v1/JSON.ashx?op=ConvertTo&convertOptions.Direction=InChi2CSID&convertOptions.Text=InChI=1S/C9H8O4/c1-6%2810%2913-8-5-3-2-4-7%288%299%2811%2912/h2-5H,1H3,%28H,11,12%29 returns the chemspider ID 2157, not the ops.rcs.org ID 403534. I assume at one point the two identifier schemes were the same - in fact the Explorer has called this "Chemspider ID" - even though now it no longer is. It is just an "OPS RCS ID".

Now this change under the hood makes a worry to me - how can we trust that the OPS id 403534 will be 403534 in the future and not 123877?

Assuming we want to keep using the OPS IDs, and as long as there is no InChi2OPSID convert option at ops.rcs.org- I guess the only fix is for the structure web service to be changed to use the exact structure match instead - as that returns the OPS identifier rather than the Chemspider ID. Alternatively - is there a URI base for Chemspider IDs we could reliably return instead?

ianwdunlop commented 9 years ago

Hello,

There is also the problem that the explorer still lists these IDs as from Chemspider, see http://explorer2.openphacts.org/compounds?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2Fdd758846-1dac-4f0d-a329-06af9a7fa413 for Aspirin. The UI reports Chemspider ID: http://ops.rsc.org/OPS40353. Obviously it's not chemspider any more since that is still http://www.chemspider.com/Chemical-Structure.2157.html for Aspirin. So what exactly is this OPS ID. Is it just the row ID in the database? Or is it generated somehow? Can we guarantee that OPS40353 will still be OPS40353 in the future? If someone publishes a paper using the Open PHACTS data they need that assurance.

Cheers,

Ian

On 24 October 2014 10:05, Stian Soiland-Reyes notifications@github.com wrote:

No, the caching is not at fault.

The difference is as you say - the /structure service effectively calls http://ops.rsc.org/api/v1/JSON.ashx?op=ConvertTo&convertOptions.Direction=Smiles2InChi&convertOptions.Text=CC%28=O%29Oc1ccccc1C%28=O%29O when called with the Aspirin SMILES CC(=O)Oc1ccccc1C(=O)O - the result looks sane to my untrained eye.

The subsequent call http://ops.rsc.org/api/v1/JSON.ashx?op=ConvertTo&convertOptions.Direction=InChi2CSID&convertOptions.Text=InChI=1S/C9H8O4/c1-6%2810%2913-8-5-3-2-4-7%288%299%2811%2912/h2-5H,1H3,%28H,11,12%29 returns the chemspider ID 2157 http://www.chemspider.com/Chemical-Structure.2157.html, not the ops.rcs.org ID 403534 http://ops.rsc.org/Compounds/Get/403534. I assume at one point the two identifier schemes were the same - in fact the Explorer has called this "Chemspider ID" http://explorer2.openphacts.org/compounds?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2Fdd758846-1dac-4f0d-a329-06af9a7fa413

  • even though now it no longer is. It is just an "OPS RCS ID".

Now this change under the hood makes a worry to me - how can we trust that the OPS id 403534 will be 403534 in the future and not 123877?

Assuming we want to keep using the OPS IDs, and as long as there is no InChi2OPSID convert option at ops.rcs.org http://ops.rsc.org/api/v1/JSON.ashx#EDirection- I guess the only fix is for the structure web service to be changed to use the exact structure match instead - as that returns the OPS identifier rather than the Chemspider ID. Alternatively - is there a URI base for Chemspider IDs we could reliably return instead?

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/205#issuecomment-60362278.

Ian Dunlop myGrid Team School of Computer Science University of Manchester

leeharland commented 9 years ago

I think what we can guarentee is that OPS40353 will not be re-used. However, OPS40353 is generated from a structure+processing_parameters... So what we're trying to work out is what happens when the same structure+slight_different_processing_parameters are passed. This may result in a different inchi in which case a new OPS ID will be minted. But the relationship of "the old" ID to this new one needs to be clarified

stain commented 9 years ago

OK.. what if we just change the /structure SMILES-to-URL webservice to return the truthful URI for chemspider then? That is afterall what the underlying ops.rcs.org service is meant to return. E.g. http://www.chemspider.com/2157

This works fine with say /compound which can give you the "new" OPS identifier http://ops.rsc.org/OPS403534 as well.


{
 "format": "linked-data-api",
 "version": "1.4",
 "result": {
  "_about": "https://beta.openphacts.org/1.4/compound?uri=http%3A%2F%2Fwww.chemspider.com%2F2157&app_id=161aeb7d&app_key=cffc292726627ffc50ece1dccd15aeaf",
  "definition": "https://beta.openphacts.org/api-config",
  "extendedMetadataVersion": "https://beta.openphacts.org/1.4/compound?uri=http%3A%2F%2Fwww.chemspider.com%2F2157&app_id=161aeb7d&app_key=cffc292726627ffc50ece1dccd15aeaf&_metadata=all%2Cviews%2Cformats%2Cexecution%2Cbindings%2Csite",
  "primaryTopic": {
   "_about": "http://www.chemspider.com/2157",
   "exactMatch": [
    {
     "_about": "http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL25",
     "mw_freebase": 180.16,
     "inDataset": "http://www.ebi.ac.uk/chembl",
     "type": "http://rdf.ebi.ac.uk/terms/chembl#SmallMolecule"
    },
    {
     "_about": "http://www.conceptwiki.org/concept/dd758846-1dac-4f0d-a329-06af9a7fa413",
     "inDataset": "http://www.conceptwiki.org",
     "prefLabel_en": "Aspirin",
     "prefLabel": "Aspirin"
    },
    {
     "_about": "http://ops.rsc.org/OPS403534",
     "inDataset": "http://ops.rsc.org",
     "hba": 4,
     "hbd": 1,
     "inchi": "InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)",
     "inchikey": "BSYNRYMUTXBXSQ-UHFFFAOYSA-N",
     "logp": 1.19,
     "molformula": "C9H8O4",
     "molweight": 180.157,
     "psa": 63.6,
     "ro5_violations": 0,
     "rtb": 3,
     "smiles": "CC(=O)OC1=CC=CC=C1C(=O)O"
    },
    {
     "_about": "http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00945",
     "inDataset": "http://linkedlifedata.com/resource/drugbank",
     "biotransformation": "Aspirin is rapidly hydrolyzed primarily in the liver to salicylic acid, which is conjugated with glycine (forming salicyluric acid) and glucuronic acid and excreted largely in the urine.",
     "description": "The prototypical analgesic used in the treatment of mild to moderate pain. It has anti-inflammatory and antipyretic properties and acts as an inhibitor of cyclooxygenase which results in the inhibition of the biosynthesis of prostaglandins. Aspirin also inhibits platelet aggregation and is used in the prevention of arterial and venous thrombosis. (From Martindale, The Extra Pharmacopoeia, 30th ed, p5)",
     "drugType": [
      "smallMolecule",
      "approved"
     ],
     "genericName": "Aspirin",
     "meltingPoint": "135 oC (boiling point 140 oC)",
     "proteinBinding": "High (99.5%) to albumin. Decreases as plasma salicylate concentration increases, with reduced plasma albumin concentration or renal dysfunction, and during pregnancy.",
     "toxicity": "Oral, mouse: LD50 = 250 mg/kg; Oral, rabbit: LD50 = 1010 mg/kg; Oral, rat: LD50 = 200 mg/kg. Effects of overdose include: tinnitus, abdominal pain, hypokalemia, hypoglycemia, pyrexia, hyperventilation, dysrhythmia, hypotension, hallucination, renal failure, confusion, seizure, coma, and death."
    }
   ],
   "isPrimaryTopicOf": "https://beta.openphacts.org/1.4/compound?uri=http%3A%2F%2Fwww.chemspider.com%2F2157&app_id=161aeb7d&app_key=cffc292726627ffc50ece1dccd15aeaf"
  }
 }
}
leeharland commented 9 years ago

i am in the process of discussing with the RSC as to whether the original behaviour can be restored otherwise this is the route we will need to take.

leeharland commented 9 years ago

update - rsc are testing a fix as we speak

danidi commented 9 years ago

The test instance of the search now correctly retrieves OPS1576568 (this is not yet available from the Open PHACTS API).

http://ops2.rsc.org/JSON.ashx?op=ConvertTo&convertOptions.Direction=SMILES2CSID&convertOptions.Text=CCCC%28CCC%29N1CCN2C%28=O%29N%28c3ccc%28OC%29cc3%29c4nc%28C%29cc1c24

danidi commented 9 years ago

Hi Lothar! The issue should be solved now on 1.4. Can you please retest?

lotharwissler commented 9 years ago

I can confirm that the issue is solved for all the cases I tested.