Open tbrx opened 2 years ago
@tbrx You could cluster the results and select a representative from each cluster?
Hey all, just to give a bit more background info on the structure based component of this: -there are two pharmacophores: a first one is hand-constructed in MOE, "Molecular Operating Environment" a commercial software for SBDD. the second pharmacophore uses the RDKit pharm3d functionality to embed molecules to a cruder pharmacophore (based on the 3d coords and feature types of the MOE pharmacophore) and score them using RMSD -Docking: two main docking techniques are used: vina (open source and standard) and MOE (proprietary). as @tbrx remarked, we don't put too much confidence in the docking scores. in fact, often there is not really a big agreement between vina and MOE, it seems MOE can deal with allosteric site a bit better, and it can account for the solvent more explicitly Some choices needed to be made for the pharmacophore. the following features were included: essential: -H pi interaction between lysine CH and aromatic ring -hydrogen bond donation from lysine backbone N to a heteroatom -molecule must fit within the pocket volume optional: -ionic interaction between cation in ligand and glutamate COO- and a nearby aspartate -hydrogen bond donation from ligand to a water -interaction with Cl- anion of course Cl- and that specific HOH are not present in all of the pdbs so opinions may differ in how sensible it is to target those. However the HOH amide interaction occurs in more than one fragment. Here is a screenshot of the pharmacophore straight from the moe gui main features: cyan = HBA purple = HBD orange = Aromatic Any other contributors who see any flaws with this or who have another wildly different pharmacophore hypothesis are welcomed to share any comments.
The idea is that because there are not really enough activity data available to go for a ligand based ML type prediction, these two approaches serve as filter and guide informed by structure and human choice for keeping the output of the generator in check or steer it the right way.
Hi all, we managed to shrink down our large collection of compounds into a smaller set of 30 or so compounds. In general, only high ranked molecules that were successful in both the docking screen and the pharmacophore screen were retained. Then, duplicates, similar compounds and synthetically infeasible compounds were filtered out by eye (my eye). The pasted below CSV also contains the "origin" of the structure, meaning: VS: search on extremely large databases, so near commercial ph4r: generator guided by pharmacophore rmsd from rdkit sim: generator guided by an objective including similarity to two of the known crystallized fragments GA: genetic algorithm guided by docking score and some descriptor cutoffs SBDD benchmark: a single hand-designed compound confirmed by docking as a reference
smiles,origin,comment
O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2C[NH2+]CC2)[nH]nc1,VS,
O=C(Nc1n[nH]c(CC)c1)c1c(C2C[NH2+]CC2)[nH]nc1,VS,
O=C(Nc1c(C(=O)Nc2cc(C(=O)N)[nH]c2)nccn1)C(C)(C)C,VS,
O=C(N(CC(O)C)CCO)Nc1[nH]nc(OC)c1,VS,
O=C(Nc1c(O)cccc1)C(O)C(O)C(O)C(O)CO,VS, gluconic acid conjugate which mighht be a problem
O=C(NC=1C(=O)NC(C)=CC=1)C(=O)N1C(CO)CC(O)C1,VS,
Clc1cc(NC(=O)c2c(C3C[NH2+]CC3)[nH]nc2)c(O)cc1,VS,
O=C(Nc1cc(NC(=O)C)cc(C(=O)Nc2c(O)cc(C)cc2)c1)C,VS,
O=C(Cc1ccccc1)NCc1cccc2c1CN(C1CCC(=O)NC1=O)C2=O,ph4g,lenalidomide scaffold
COc1cc2c(cc1Nc1ncc(Cl)c(NC3CCCCC3NS(C)(=O)=O)n1)CCNC(=O)C2,ph4g,
O=C1CCC(N2Cc3c(CNC(=O)c4cccc(Cl)c4)cccc3C2=O)C(=O)N1,ph4g,lenalidomide scaffold
COC(=O)c1cc(Cl)ccc1NC(=O)COCC(=O)N1CCC(N2C(=O)CCC2=O)CC1,ph4g,
O=C(Nc1ccc(Br)cc1)NC1C(O)OC(CO)C(O)C1O,ph4g, comparable with gluconic amide from VS
S(=O)(=O)(NC(C(=O)Nc1cc(C#N)ccc1)C)c1ccc(N2CC[NH+](C)CC2)cc1,sim,
Fc1cc(C(O)C(=O)Nc2cc(C#N)ccc2)c(NC(=O)N2CC[NH+](C)CC2)cc1,sim,
O=C(Nc1cc(C#N)ccc1)C([NH2+]Cc1cc(N2CC[NH+](C)CC2)ccc1)C,sim, double cation - investigate protonation state at 7.4
S(CC(NC(=O)Nc1ccc(NC(=O)N2CC[NH+](C)CC2)cc1)(CO)C)C,sim,strange binding mode
S(CCC(NC(=O)N1CC[NH+](C)CC1)C(=O)Nc1cc(C#N)ccc1)C,sim,
CC(CO)(CO)NC(=O)Nc1ccccc1,GA,too close to known frag?
Cc1nc2c(C#N)ccc(O)c2[nH]1,GA,
N#Cc1ccc(O)c(Nc2nccs2)c1,GA,
CNCC1=C(C(=O)Nc2ccccc2)CCC1,GA,
CNCc1ccc(-c2cccc3[nH]c(C)nc23)nn1,GA,
Cc1ccc(C)c2oc(-c3ccc(O)nn3)cc12,GA,
O=C(Nc1ccc(O)s1)c1ccccc1,GA,
CCCCC(=O)Nc1cc(CC)ccc1O,GA,
Cc1c[nH]c(=O)n1CCC(=O)c1ccccc1,GA,
O=C(Nc1cccc(NCc2ccccc2)c1)c1ccccc1,GA,
Nc1cccc(CC(=O)Nc2ccccc2)c1,GA,
O=C(NCc1ccccc1)C1CCNC1,GA,
Cc1cccc(-c2ncc(CN)o2)c1,GA,
O=C(Nc1cccc(-c2ccccc2)c1)C1CCNCC1,GA,
[NH3+]C[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O,SBDD benchmark,
for the more visually inclined, here is the mol grid image:
We also have a bit more data per compound available so if we will want to prioritize any compounds we also have docking conformations etc available, and can also calc some descriptors and filter out anything too lipophilic et cetera. We'd be glad to hear any comments and criticisms of our set thus far!
Hi! Compounds are most likely to accumulate in E. coli if they contain a sterically unencumbered amine (such as a primary amine), are relatively rigid, and have low globularity. I filtered your top 30 hits to have <5 rotatable bonds, <.1 globularity and contain a primary amine), which yielded 6 compounds:
O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2C[NH2+]CC2)[nH]nc1 O=C(Nc1n[nH]c(CC)c1)c1c(C2C[NH2+]CC2)[nH]nc1 O=C(Nc1c(C(=O)Nc2cc(C(=O)N)[nH]c2)nccn1)C(C)(C)C Clc1cc(NC(=O)c2c(C3C[NH2+]CC3)[nH]nc2)c(O)cc1 Nc1cccc(CC(=O)Nc2ccccc2)c1 (available from Enamine BBV-34448717) Cc1cccc(-c2ncc(CN)o2)c1 (available from Enamine BBV-92898715)
Your top hits with additional properties are here
**WIP: I did check the synthetic routes for these 6 compounds in Postera, but it seems to remove the primary amine when searching, as pointed out by @drc007 in the comment below. I'm looking into why.
(I'm by no means an expert, this is the filtering criteria that I and other have used in various rounds of this challenge)
@finlayiainmaclean None of the examples above appear to contain a primary amine?
@drc007 You're correct that searching the molecules (eg O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2C[NH2+]CC2)[nH]nc1) through Postera seems to remove the primary amines (resulting in O=C(Nc1nc2ccccc2[nH]1)c1cn[nH]c1C1CCNC1). I've ammended the above post and will see if I can generate synthetic routes without this change in structure.
Hi all, thank you for the synthetic routes and remarks on "rules of thumb" around accumulation in E.Coli.
I read up using these references [0][1], certainly primary amines seem like a good strategy to increase accumulation, though as can be seen in [0] often a working antibiotic structure is necessary before modification to increase accumulation is considered. Some more interesting results are in [1], where they suggest a positive charge might be the most important factor. They also mention globularity and flexibility so I assume this is the paper @finlayiainmaclean is referencing. Fortunately most of these compounds upthread will have a charged secondary amine at pH 7.4 so the main important factor is there. However, they also note: Even conversion of the primary amine to an amine with more substitutions had a deleterious effect on accumulation
. In fig.2 of the paper they show a few series of compounds and compare primary secondary tertiary quaternary directly, which is pretty interesting, and shows a marked but not complete decrease.
They use a simple random forest model based on descriptors to model E.Coli accumulation - it might be interesting to reproduce this so people can use it here to get some more insights into their compounds or prioritize candidates. The model is available here but it relies on proprietary software to generate conformations, which are necessary for some of the descriptors https://github.com/HergenrotherLab/GramNegAccum.
[0] 10.1021/acsinfecdis.0c00715 [1] 10.1038/nature22308
@drc007 You're correct that searching the molecules (eg O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2C[NH2+]CC2)[nH]nc1) through Postera seems to remove the primary amines (resulting in O=C(Nc1nc2ccccc2[nH]1)c1cn[nH]c1C1CCNC1). I've ammended the above post and will see if I can generate synthetic routes without this change in structure.
Also, what happened here is just deprotonation of the (secondary) amine. Interconversion between these is pretty easy (and the major form will be protonated at pH 7.4 anyway) so it is sensible to look for routes including the freebase amine. But as @drc007 remarks, most of the molecules are secondary amines. The only primary amine in our set is [NH3+]C[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O
. I think the smarts query you are using for primary amine is probably faulty, due to not accounting for charge. In case youre doing this in rdkit, here are smarts patterns for primary amines, charged and uncharged:
smis = ["[NH4+]","[NH3+]C","[NH2+](C)C","[NH+](C)(C)C","[N+](C)(C)(C)C","N","NC","N(C)C","N(C)(C)C"]
for smi in smis:
mol = Chem.MolFromSmiles(smi)
patt = Chem.MolFromSmarts("[NX3;!H3;H2,!H1;!H0;!$(NC=O)]") #uncharged
patt2 = Chem.MolFromSmarts("[NX4+;!H4;H3,!H2;!H1;!H0;!$(NC=O)]") #charged
print(mol.HasSubstructMatch(patt2) or mol.HasSubstructMatch(patt))
this should output, False,True,False,False,False,False,True,False,False because only [NH3+]C and NC are primary amines.
Very interesting @dehaenw @drc007 @finlayiainmaclean I'd say that the accumulation in E coli via implementation of these EntryWay criteria is a "nice to have". At this stage we're interested in binders, and can engineer in some acumulation biases later. Not a problem to include now, but not necessary.
@dehaenw shall we just take a look at the top 30 you've shown above? It'd be very useful to know which can just be bought, if anyone has a quick way of sorting the molecules in that way? Otherwise @edwintse will try manually, and @danielgedder can help with considering simplest synthetic routes?
From the description of the methodology above, based on the fragment-defined pharmacophore, I'm interested to see what happens here, experimentally.
I would definitely just consider these 30 compounds for now. A few of these, but not all, are part of make on demand libraries. Searching on chem-space.com using the unprotonated amines:
O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2CNCC2)[nH]nc1
O=C(Nc1n[nH]c(CC)c1)c1c(C2CNCC2)[nH]nc1
O=C(Nc1c(C(=O)Nc2cc(C(=O)N)[nH]c2)nccn1)C(C)(C)C
O=C(N(CC(O)C)CCO)Nc1[nH]nc(OC)c1
O=C(Nc1c(O)cccc1)C(O)C(O)C(O)C(O)CO
O=C(NC=1C(=O)NC(C)=CC=1)C(=O)N1C(CO)CC(O)C1
Clc1cc(NC(=O)c2c(C3CNCC3)[nH]nc2)c(O)cc1
O=C(Nc1cc(NC(=O)C)cc(C(=O)Nc2c(O)cc(C)cc2)c1)C
O=C(Cc1ccccc1)NCc1cccc2c1CN(C1CCC(=O)NC1=O)C2=O
COc1cc2c(cc1Nc1ncc(Cl)c(NC3CCCCC3NS(C)(=O)=O)n1)CCNC(=O)C2
O=C1CCC(N2Cc3c(CNC(=O)c4cccc(Cl)c4)cccc3C2=O)C(=O)N1
COC(=O)c1cc(Cl)ccc1NC(=O)COCC(=O)N1CCC(N2C(=O)CCC2=O)CC1
O=C(Nc1ccc(Br)cc1)NC1C(O)OC(CO)C(O)C1O
S(=O)(=O)(NC(C(=O)Nc1cc(C#N)ccc1)C)c1ccc(N2CCN(C)CC2)cc1
Fc1cc(C(O)C(=O)Nc2cc(C#N)ccc2)c(NC(=O)N2CCN(C)CC2)cc1
O=C(Nc1cc(C#N)ccc1)C(NCc1cc(N2CCN(C)CC2)ccc1)C
S(CC(NC(=O)Nc1ccc(NC(=O)N2CCN(C)CC2)cc1)(CO)C)C
S(CCC(NC(=O)N1CCN(C)CC1)C(=O)Nc1cc(C#N)ccc1)C
CC(CO)(CO)NC(=O)Nc1ccccc1
Cc1nc2c(C#N)ccc(O)c2[nH]1
N#Cc1ccc(O)c(Nc2nccs2)c1
CNCC1=C(C(=O)Nc2ccccc2)CCC1
CNCc1ccc(-c2cccc3[nH]c(C)nc23)nn1
Cc1ccc(C)c2oc(-c3ccc(O)nn3)cc12
O=C(Nc1ccc(O)s1)c1ccccc1
CCCCC(=O)Nc1cc(CC)ccc1O
Cc1c[nH]c(=O)n1CCC(=O)c1ccccc1
O=C(Nc1cccc(NCc2ccccc2)c1)c1ccccc1
Nc1cccc(CC(=O)Nc2ccccc2)c1
O=C(NCc1ccccc1)C1CCNC1
Cc1cccc(-c2ncc(CN)o2)c1
O=C(Nc1cccc(-c2ccccc2)c1)C1CCNCC1
NC[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O
yields the attached output (cleaned up by me to remove duplicates), showing about half of the compounds is in ChemSpace's catalog (which includes catalogs or subsets of them of enamine, uorsy etc).
chemspace-search-20220621110236.csv
searching ZINC, i get the following output ("source line" field corresponds to the compound number, it includes duplicates):
this is from ZINC, so I have not checked which of the compounds are actually in stock.
Here is a sdf file containing all structures, availability from vendors, PubChem/ChEMBL ID where relevant and a quick patent search. OSAmolecules.sdf.zip
That's great work @dehaenw and @drc007. @edwintse @danielgedder - what do you think? Any easy purchases and easy syntheses? We've about a week to order compounds in... Are there some quick wins here, particularly if we can order in compounds that are representative of any clusters that are structurally similar? @dehaenw I'm assuming there are no other criteria for ranking these?
@mattodd Here's a summary after a quick scifinder search for the 30 compounds
Thank you @edwintse for the overview. It's good to see there are a few commercially available.
@mattodd, I think for prioritizing there are some logical compounds to pick, because they would give the most information from the pharmacophore POV: pyrrolidinopyrazoles occur 5 times, so in this group would probably be worthwhile to check one.
Regarding this class: In the two orange compounds above (from @edwintse's image) which belong to this class, the upper one's substitution is probably better, because the basic amine of the pyrrolidine is important for predicted binding (makes a h-bond to a backbone carbonyl oxygen of Gly309). Pyrrolidine to piperidine is probably an OK substitution. Unfortunately the addition of that methyl group on the benzimidazole could kill activity because of steric reasons, as you can see in the pic below there is not really place (pdb fragment in purple, originally proposed molecule without methyl on benzimidazole in green):
Out of the molecules of this type, probably the most useful one to aim for is Clc1cc(NC(=O)c2c(C3C[NH2+]CC3)[nH]nc2)c(O)cc1
or even the 3-cyanophenyl derivate for ideal comparison with the known fragments. On enamine i can find Z2606031307 as the closest analog, Z2377588160 as another interesting one. This is probably not relevant currently, but it seems a lot of derivatives of these are available in the REAL catalog.
The third orange compound: removing dimethyl propionyl should be an ok substitution.
The fourth: removing methyls is completely fine
The commercial substances in green look good. Those are pretty diverse sensible picks.
Then regarding the remaining black compounds - which ones would give the most information, which ones are the more likely hitters. Approximately in this order my priority would be:
[NH3+]C[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O
because it is fairly close to the fragments and has a nice docking pose
O=C(N(CC(O)C)CCO)Nc1[nH]nc(OC)c1
to explore if this heterocycle is a good replacement for the benzene ring in the fragments (substituting for the more accessbile bis hydroxyethyl substance and change the ring substituent to ethyl, cyano, a halogen should all be fine)
N#Cc1ccc(O)c(Nc2nccs2)c1
to see if thiazole N can undergo the H-bond from lys backbone N
O=C(Nc1ccc(Br)cc1)NC1C(O)OC(CO)C(O)C1O
to see if having a sugar like group in this region of the pocket is sensible at all
O=C(NC=1C(=O)NC(C)=CC=1)C(=O)N1C(CO)CC(O)C1
would be interesting if this oxalamide was active
one more thing. just noticed that the entry CC(CO)(CO)NC(=O)Nc1ccccc1,GA,too close to known frag?
, the second green compound, is actually not close, but identical to frag 374. Sorry! (though it is good to see it still shows up within the hits.)
OK @edwintse want to update the orange/greens based on @dehaenw suggestions? Looks like we can quickly order some here. Then for the 5 suggested "makes", are they ca one-steppers?
From my perspective it would be nice if we could try to include some from each of the "different" generation methods (i.e. "VS","ph4g","sim", and "GA" in @dehaenw 's post up above). I sort of lost track of which molecules are which in the Orange / Green / Blue classification — is it the case that all of them are represented as orderable, or easy makes?
If the easily obtainable ones are all from "VS" (rather than from the other three) would be slightly disappointing…
From my perspective it would be nice if we could try to include some from each of the "different" generation methods (i.e. "VS","ph4g","sim", and "GA" in @dehaenw 's post up above). I sort of lost track of which molecules are which in the Orange / Green / Blue classification — is it the case that all of them are represented as orderable, or easy makes?
If the easily obtainable ones are all from "VS" (rather than from the other three) would be slightly disappointing…
I agree, it would be best to have one (or more) of each category, from our POV, because this would allow us at least some insight into which of the strategies is most promising. Making use of the info above, I would give the following "top 2" for each category:
VS:
O=C(Nc1cc(NC(=O)C)cc(C(=O)Nc2c(O)cc(C)cc2)c1)C
commercially available
Clc1cc(NC(=O)c2c(C3C[NH2+]CC3)[nH]nc2)c(O)cc1
"privileged scaffold"
ph4g:
O=C(Nc1ccc(Br)cc1)NC1C(O)OC(CO)C(O)C1O
O=C1CCC(N2Cc3c(CNC(=O)c4cccc(Cl)c4)cccc3C2=O)C(=O)N1
patent structure
sim:
S(CCC(NC(=O)N1CC[NH+](C)CC1)C(=O)Nc1cc(C#N)ccc1)C
S(=O)(=O)(NC(C(=O)Nc1cc(C#N)ccc1)C)c1ccc(N2CC[NH+](C)CC2)cc1
GA:
Cc1cccc(-c2ncc(CN)o2)c1
commerically available
O=C(NCc1ccccc1)C1CCNC1
commercially available
SBDD Benchmark:
[NH3+]C[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O
Hi, a few of us at UCL CS (and further afield) have a WIP submission; I realize we're getting down to the wire and we're having a bit of a challenge finishing this this off so I wanted to share our progress and possibly reach out for assistance. We've been generating candidate molecules and are having a bit of a challenge now reducing the generated set into a good small set of distinct "entries".
We've gone through a few iterations of trying to decide what a good "reward" should be, to direct the generative model. One option is docking scores.
@dehaenw has defined a pharmacophore model based on the four initial fragments, which we are also using as a screen. He can provide more details on this. At the moment, this is our primary way of narrowing down the candidates. (There were too-few fragments for us to be confident in any QSAR model.)
There are two distinct ways we've been generating candidates for later filtering:
This is all well and good, but we quickly simulated thousands of molecules and have been struggling to narrow them down. The process we are looking at is roughly
We'll update this over the course of the day / weekend.