openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Chemistry search service broken #64

Open leeharland opened 10 years ago

leeharland commented 10 years ago

broken in production (1.3) right now.

[nb some discussion on https://github.com/openphacts/GLOBAL/issues/14]

leeharland commented 10 years ago

valery confirmed OCR fine, assigning to open link as next step

leeharland commented 10 years ago

ok thanks to some great detective work this is becuase urls like: https://beta.openphacts.org/1.3/structure/similarity?app_id=y&app_key=x&searchOptions.Molecule=CC(%3DO)Oc1ccccc1C(%3DO)O&searchOptions.SimilarityType=0

dont specify any thresholds/limits so return a very large result sets. Investigating next steps with defaults perhaps

leeharland commented 10 years ago

would be good to get some info from open link on where this died - presumably too much data or a timeout?

ghard commented 10 years ago

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA512

On 24/04/2014 11:40, Lee Harland wrote:

would be good to get some info from open link on where this died - presumably too much data or a timeout?

— Reply to this email directly or view it on GitHub https://github.com/openphacts/GLOBAL/issues/64#issuecomment-41261316.

I'll check if there's something in the logs to indicate what happened in a bit.

Yrjänä


Yrjänä Rankka (ghard@zonk.net) Grand Praetor of Excruciations - ZONK.NET Propaganda HQ ZONK.NET - Advancing the Thermal Death of the Multiverse Since 1998 -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBCgAGBQJTWNxwAAoJEPwzzlnBROJ7gP4P/1t+UOhvn/k2SxfsSvXm+QK7 rah2l3sWuSRCNVaJFNKUSPS7FdZ+TM4qJ6/Ez5X3wvxP6sLysYlYMdx2xXV1SSxm WkOhZJNeystPDbNDaUplPMGP2uldTykFJX3K1ChZ7Ci81/mkew3eUlGHiEYADumv iU5lNyZ1M97tKteMYx11HlolqHY1QYzUQa74JqoLX6Rlf3Vitg9vyB+mcQs44rc3 be0pkoebA5diFg1foLht9ToNxQ8TJn5BywSWhKFDBtJ2eVkL/9PYvIlA0fkQhprs jZDX7Ri3128KpQruPZx4r3RR4/HLcwtLi8C8RFAZpxBeMieAN60P5cni9h8PuLH+ pWM+Zc0+k+1kEWwYdw6b82E/ge+/DvopYQG5FU9m2xGH5uih86ULw+DB0TpPeA+7 07fE1Cr2GomLmvGDraUcTGwp112ZBtaY5jZX9jmjgKpGpzEs7t/AFRthMEEAIW5J 8FuZ6e8W6e8C5gLMUhFbFwAYzh9BLcLyWgbY+osZ/sG2OqXc5v1r+G0bszsIz3FE 2JwbWsQq6pNNBysujdZDr7/iypNEaj+860nbeNT4UgJF2V6BdPzRDl66n47D8axV cZAnbW9hPfsSPKkrFQJYP0UnqF7ZgFK+ee10ZAt8FxM0xyUwz/+vZOW/vikRwcu8 5l/x12fSPY7OB3DJFsHI =3ZzM -----END PGP SIGNATURE-----

leeharland commented 10 years ago

@ghard - any update?

ChristineChichester commented 10 years ago

The too much data queries still seem to cause problems but with limits it is functioning (at least for my tests) This works https://beta.openphacts.org/1.4/structure/similarity?app_id=18983b12&app_key=c99cf43da48a1a2f9069651fe6be7c06&searchOptions.Molecule=C1%3DCC(%3DCC%3DC1C%5BC%40H%5D(C(%3DO)N%5BC%40H%5D(CCCN%3DC(N)N)C(%3DO)O)N)O&searchOptions.SimilarityType=0&searchOptions.Threshold=0.75 but changing the threshold to 0.50 doesnt.

ChristineChichester commented 10 years ago

Antonis confirms we could add threshold limits, for instance >99, 98, 90, 80, 70, and 50% on the similarity search and return 400 when another value is given. This should help but will not guarantee that some results sets will still not cause problems.

danidi commented 10 years ago

Maybe the easiest would be to give a default value of 0.8? So if people forget to specify a threshold they still get data back.

leeharland commented 10 years ago

hoping @ghard or @antonisloizou can give us some insight into why its failing

@ChemConnector - what do you think for the default??

ChemConnector commented 10 years ago

A default threshold of 0.8 would be acceptable for sure. A threshold of 0.5 or less makes little sense to me and while the value could be dropped to 0.7 or lower I have always found compounds of interest at >0.8 on ChemSpider, a much larger collection than the OPS-CRS for sure.

ChristineChichester commented 10 years ago

Antonis commented about the similarity search, there are 3 options:

  1. List the allowed values in the description, return 400 when another value is given
  2. Provide a dropdown of allowed values on 3scale, return 400 when another value is given by manually constructing a request
  3. Allow any value, but only make request to the RSC API using the closest one from {99, 95, 90, 80, 70, 50} to the user input

Option 2 is the most problematic as it needs to be hardcoded inside the swagger generation script

leeharland commented 10 years ago

apologies for being slow today, can someone remind me why we cant just allow any value but if no value is supplied it defaults to 0.8 as tony suggested (and maybe we if <0.5 we force 0.5?) thanks

ChristineChichester commented 10 years ago

Does the RSC API allow any value? On the ChemSpider similarity search interface they only give options via a dropdown for certain values (>=99, 95, 90, 80, 70, 50}. For me at least, using .50 from our side didnt work but .75 did.

leeharland commented 10 years ago

@ChemConnector @valt could you comment?

ChemConnector commented 10 years ago

That sounds absolutely fine to me Lee. We simply block all searches with values below 0.7 and add a comment to the screen that that is the default (and minimum) value.

leeharland commented 10 years ago

@ChemConnector @valt @karapetk

hi folks - hopefully you saw #176 and could we get an update on both of these? thanks

lucaxbartek commented 10 years ago

Many times, when the same Euclidean search would work on Chemspider, it fails on the API. Examples include:

C(=C/Cl)\Cl with 0.9 threshold
CCCC(C)C1(C(=O)NC(=NC1=O)[O-])CC.[Na+] with 0.8 threshold

The list goes on. My suspicion is that this is due to the fact that CS has a default value of 100 for search hits limit. When you change that to 1000 all calls fail. Would it be possible to limit the search results by default like previously suggested? or (also as previously mentioned) raise the timeout limit? Perhaps any other solution available?

valt commented 10 years ago

Here is what we see running both queries in batch mode:

Running similarity (Euclidian 0.9) on 2 SMILES in 8 threads using http://ops.rsc.org/api/v1/JSON.ashx 0: SMILES:CCCC(C)C1(C(=O)NC(=NC1=O)[O-])CC.[Na+]; Count: 8; Duration: 20.627287 1: SMILES:C(=C/Cl)\Cl; Count: 2438; Duration: 66.3834998 Total,2 Errors,0 Success,2 Total Time,66.4174926 sec.

Running similarity (Euclidian 0.8) on 2 SMILES in 8 threads using http://ops.rsc.org/api/v1/JSON.ashx 0: SMILES:CCCC(C)C1(C(=O)NC(=NC1=O)[O-])CC.[Na+]; Count: 276; Duration: 30.7441478 1: SMILES:C(=C/Cl)\Cl; Count: 30094; Duration: 320.8561992 Total,2 Errors,0 Success,2 Total Time,320.9128618 sec.

  1. We do not set any timeout on our side - please check 3scale.
  2. Please don't test it on CHEMSPIDER - test here: http://ops.rsc.org/
  3. If the above performance is unacceptable we'll need to solve it in complex (scaling out for example)