openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Chemical Structure Search: Substructure searching fails although exact search is working #68

Open leeharland opened 10 years ago

leeharland commented 10 years ago

Issue: For some SMILES exact search is working but substructure searching fails

Origin of SMILES: The SMILES originate from ChEMBL and are for compounds that have results for Factor X (Homo sapiens).

Example SMILES: Clc1ccc2cc(sc2n1)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3

What happens when a search with the example SMILES string is performed on ops.rsc.org directly (instead of via the Open PHACTS API)?

Exact search: 1) http://ops.rsc.org/JSON.ashx?op=ExactStructureSearch&searchOptions.Molecule=Clc1ccc2cc(sc2n1)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3 2) http://ops.rsc.org/JSON.ashx?op=GetSearchStatus&rid=5157fe65-c4b2-418e-a304-3830132825a8 => {"Count":1,"Elapsed":"PT15.357S","Message":"Finished","Progress":1,"Status":6} 3) http://ops.rsc.org/JSON.ashx?op=GetSearchResult&rid=5157fe65-c4b2-418e-a304-3830132825a8 => [1536786] Substructure search: 1) http://ops.rsc.org/JSON.ashx?op=SubstructureSearch&searchOptions.Molecule=Clc1ccc2cc(sc2n1)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3&resultOptions.Limit=10 2) http://ops.rsc.org/JSON.ashx?op=GetSearchStatus&rid=3ad50d8f-dd11-48a2-bc8f-ddc0b6b6c59e => {"Count":0,"Elapsed":"PT15.31S","Message":"Finished","Progress":1,"Status":6} 3) http://ops.rsc.org/JSON.ashx?op=GetSearchResult&rid=3ad50d8f-dd11-48a2-bc8f-ddc0b6b6c59e => [] The above result indicates that the substructure search for the example SMILES doesn’t return a result on ops.rsc.org but the exact search returns a result. The complete list of 41 SMILES for which a substructure search fails but an exact search is successful when performed with the Open PHACTS API can be found below: Clc1ccc2cc(sc2n1)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3 Clc1cnc2cc(sc2c1)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3 NC(=N)c1ccc(CNC(=O)[C@@H]2Cc3ccc(NC(=O)CCN4CCN(CC4)CCC(=O)Nc5cccc(CC@@HCc6ccccc6)C(=O)N2)c5)cc3)cc1.OC(=O)C(F)(F)F O=C1CN(CCN1Cc2cc3c[nH]ccc3n2)S(=O)(=O)c4cc5ccccc5s4 Clc1ccc2cc(sc2c1)S(=O)(=O)N3CCN(Cc4cc5[nH]cccc5n4)C(=O)C3 [Na+].COc1cc2oc(C)c(CCC(=O)[O-])c2cc1OS(=O)(=O)O Clc1ccc2cc(sc2c1)S(=O)(=O)N3CCN(Cc4nc5cc[nH]cc5n4)C(=O)C3 CC(C)(C(=O)O)c1cc(c(O)c(c1)c2cc3c(N)[nH]ccc3n2)c4cccc(CNC(=O)Nc5c(F)cccc5F)c4 O[C@@H]1COC(=O)c2cc(O)c(O)c(O)c2c3c(O)c(O)c(O)cc3C(=O)O[C@H]1[C@@H]4OC(=O)c5cc(O)c(O)c(O)c5c6c(O)c(O)c(O)c7C@H[C@@H]4OC(=O)c67 NC(=N)c1ccc(CNC(=O)[C@@H]2Cc3cccc(NC(=O)CN4CCN(CC4)CC(=O)Nc5cccc(CC@@HC(=O)N2)c5)c3)cc1.OC(=O)C(F)(F)F Clc1ccc(cc1)c2ccc(cc2)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3 Clc1ccc2cc(sc2c1)S(=O)(=O)N3CCN(Cc4cc5cc[nH]cc5n4)C(=O)C3 CC(C)(C(=O)O)c1cc(c(O)c(c1)c2cc3c(N)[nH]ccc3n2)c4cccc(CNC(=O)C@@HCc5ccccc5)c4 Clc1ccc(\C=C\S(=O)(=O)N2CCN(Cc3cc4c[nH]ccc4n3)C(=O)C2)cc1 CC(C)(C(=O)O)c1cc(c(O)c(c1)c2cc3c(N)[nH]ccc3n2)c4cccc(CNC(=O)Nc5ccc(cc5)C(=O)O)c4 O[C@@H]1COC(=O)c2cc(O)c(O)c(O)c2c3c(O)c(O)c(O)cc3C(=O)O[C@H]1[C@@H]4OC(=O)c5cc(O)c(O)c(O)c5c6c(O)c(O)c(O)c7C@H[C@@H]4OC(=O)c67 O=C1CN(CCN1Cc2cc3c[nH]ccc3n2)S(=O)(=O)c4cc5cccnc5s4 NC(=N)c1ccc(CNC(=O)[C@@H]2Cc3ccc(NC(=O)CN4CCCN(CC4)CC(=O)Nc5ccc(CC@@HC(=O)N2)cc5)cc3)cc1.OC(=O)C(F)(F)F OC@@HC(=O)NCc2cccc(c2)c3cccc(c3O)c4cc5c[nH]ccc5n4 NC(=N)c1ccc(CNC(=O)[C@@H]2Cc3ccc(NC(=O)CN4CCN(CC4)CC(=O)Nc5cccc(CC@@HC(=O)N2)c5)cc3)cc1.OC(=O)C(F)(F)F Clc1ccc(\C=C\S(=O)(=O)N2CCN(Cc3cc4c[nH]ccc4n3)C(=O)C2)cc1 CC(C)(C(=O)O)c1cc(c(O)c(c1)c2cc3c(N)[nH]ccc3n2)c4cccc(CNC(=O)Nc5ccc(cc5)C(=O)O)c4 O[C@@H]1COC(=O)c2cc(O)c(O)c(O)c2c3c(O)c(O)c(O)cc3C(=O)O[C@H]1[C@@H]4OC(=O)c5cc(O)c(O)c(O)c5c6c(O)c(O)c(O)c7C@H[C@@H]4OC(=O)c67 O=C1C(CCN1Cc2cc3cc[nH]cc3n2)NS(=O)(=O)c4cc5ncccc5s4 O=C1CN(CCN1Cc2cc3c[nH]ccc3n2)S(=O)(=O)c4cc5ncccc5s4 Nc1[nH]ccc2nc(cc12)c3cccc(c3O)c4cccc(CNC(=O)C@@HCc5ccccc5)c4 Nc1[nH]ccc2nc(cc12)c3cccc(c3O)c4cccc(CNC(=O)C@@HCc5ccccc5)c4 Nc1[nH]ccc2nc(cc12)c3cc(cc(c3O)c4cccc(CNC(=O)C@@HCc5ccccc5)c4)C(=O)O Nc1[nH]ccc2nc(CN3CCN(CC3=O)S(=O)(=O)c4cc5ccc(Cl)cc5s4)cc12 Clc1ccc(s1)c2ccc(s2)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3 NC(=O)NCc1cccc(c1)c2cccc(c2O)c3cc4c[nH]ccc4n3 Clc1ccc(s1)c2ccc(s2)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3 NC(=N)c1cccc(CC@HC(=O)N4CCC(CC4)N5CCCCC5)c1 COc1cc(\C=C\S(=O)(=O)N2CCN(Cc3cc4c[nH]ccc4n3)C(=O)C2)sc1Cl Clc1ccc(\C=C\S(=O)(=O)N2CCN(Cc3cc4c[nH]ccc4n3)C(=O)C2)s1 Clc1ccc(CCS(=O)(=O)N2CCN(Cc3cc4c[nH]ccc4n3)C(=O)C2)s1 Brc1ccc2cc(sc2c1)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3 Oc1[nH]ccc2nc(CN3CCN(CC3=O)S(=O)(=O)c4cc5ccc(Cl)cc5s4)cc12 Clc1ccc(\C=C\S(=O)(=O)N2CCN(Cc3cc4c[nH]ccc4n3)C(=O)C2)s1 Clc1c(sc2ccccc12)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3 Clc1ccc2cc(sc2c1)S(=O)(=O)N3CCN(Cc4cc5c[nH]ccc5n4)C(=O)C3

leeharland commented 10 years ago

Ken Karapetyan added a comment - 20/Mar/14 7:36 PM We are able to reproduce this bug. Seems like this could be related to cases when our cheminformatics toolkit Indigo is not able to kekulize molecules with explicit hydrogen bonded to aromatized atoms. Issue has been reported to our chemical search cartridge vendor (GGA):

https://groups.google.com/forum/#!topic/indigo-bugs/4yETsKjVBvM https://groups.google.com/forum/#!topic/indigo-bugs/7e6QBimcq3g

StefanSenger commented 10 years ago

Just adding this comment so that it's easier for me ( @StefanSenger ) to 'watch' this issue

karapetk commented 10 years ago

Unfortunately Indigo is not responding to my tickets

@valt