Competition Entry (WIP)

opensourceantibiotics / murligase

Everything to do with the Mur Ligase Project

29 stars 6 forks source link

Competition Entry (WIP) #83

Open tbrx opened 2 years ago

tbrx commented 2 years ago

Hi, a few of us at UCL CS (and further afield) have a WIP submission; I realize we're getting down to the wire and we're having a bit of a challenge finishing this this off so I wanted to share our progress and possibly reach out for assistance. We've been generating candidate molecules and are having a bit of a challenge now reducing the generated set into a good small set of distinct "entries".

We've gone through a few iterations of trying to decide what a good "reward" should be, to direct the generative model. One option is docking scores.

@dehaenw has defined a pharmacophore model based on the four initial fragments, which we are also using as a screen. He can provide more details on this. At the moment, this is our primary way of narrowing down the candidates. (There were too-few fragments for us to be confident in any QSAR model.)

There are two distinct ways we've been generating candidates for later filtering:

I've been running the stochastic search method from this paper, with code here. It is a deep generative model for molecules trained as an autoencoder; the catch is it proposes molecules via an estimated synthetic route, so it should (hopefully) only generate compounds that are stable and easy to source. This has been run with two different objectives (both subject to LogP and molecular weight being in the required range):
- Maximizing average Tanimoto similarity to pairs of the original four fragments: this finds a large number of molecules which are pieced together from elements of the original fragments
- Minimizing pharmacophore RMSD: this finds a small number of novel structures, which look plausibly interesting
@AndreiP25 has been generating molecules using a genetic algorithm. I'll let him expand on the details; but the genetic algorithm is attempting to minimize docking scores subject to a number of constraints (on QED, sascore, LogP).

This is all well and good, but we quickly simulated thousands of molecules and have been struggling to narrow them down. The process we are looking at is roughly

@AndreiP25 runs a set of medchem filters, which will reject molecules with known undesirable structures (this filters out things which are likely unstable or reactive, etc)
We use @dehaenw 's pharmacophore model to reject any with too-large RMSD
@an81 and @dehaenw have been docking the remaining proposals. The docking scores themselves don't seem indicative of much, so instead we're looking to find those molecules for with both docking and the pharmacophore model are in agreement regarding ligand pose
This still leaves us with quite a few molecules — we're looking at scanning through libraries of compounds to see which of these are already available. @dehaenw also has been doing manual (visual) inspection to narrow down to those which seem "reasonable".

We'll update this over the course of the day / weekend.

drc007 commented 2 years ago

@tbrx You could cluster the results and select a representative from each cluster?

dehaenw commented 2 years ago

Hey all, just to give a bit more background info on the structure based component of this: -there are two pharmacophores: a first one is hand-constructed in MOE, "Molecular Operating Environment" a commercial software for SBDD. the second pharmacophore uses the RDKit pharm3d functionality to embed molecules to a cruder pharmacophore (based on the 3d coords and feature types of the MOE pharmacophore) and score them using RMSD -Docking: two main docking techniques are used: vina (open source and standard) and MOE (proprietary). as @tbrx remarked, we don't put too much confidence in the docking scores. in fact, often there is not really a big agreement between vina and MOE, it seems MOE can deal with allosteric site a bit better, and it can account for the solvent more explicitly Some choices needed to be made for the pharmacophore. the following features were included: essential: -H pi interaction between lysine CH and aromatic ring -hydrogen bond donation from lysine backbone N to a heteroatom -molecule must fit within the pocket volume optional: -ionic interaction between cation in ligand and glutamate COO- and a nearby aspartate -hydrogen bond donation from ligand to a water -interaction with Cl- anion of course Cl- and that specific HOH are not present in all of the pdbs so opinions may differ in how sensible it is to target those. However the HOH amide interaction occurs in more than one fragment. Here is a screenshot of the pharmacophore straight from the moe gui main features: cyan = HBA purple = HBD orange = Aromatic Any other contributors who see any flaws with this or who have another wildly different pharmacophore hypothesis are welcomed to share any comments.

The idea is that because there are not really enough activity data available to go for a ligand based ML type prediction, these two approaches serve as filter and guide informed by structure and human choice for keeping the output of the generator in check or steer it the right way.

dehaenw commented 2 years ago

Hi all, we managed to shrink down our large collection of compounds into a smaller set of 30 or so compounds. In general, only high ranked molecules that were successful in both the docking screen and the pharmacophore screen were retained. Then, duplicates, similar compounds and synthetically infeasible compounds were filtered out by eye (my eye). The pasted below CSV also contains the "origin" of the structure, meaning: VS: search on extremely large databases, so near commercial ph4r: generator guided by pharmacophore rmsd from rdkit sim: generator guided by an objective including similarity to two of the known crystallized fragments GA: genetic algorithm guided by docking score and some descriptor cutoffs SBDD benchmark: a single hand-designed compound confirmed by docking as a reference

smiles,origin,comment
O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2C[NH2+]CC2)[nH]nc1,VS,
O=C(Nc1n[nH]c(CC)c1)c1c(C2C[NH2+]CC2)[nH]nc1,VS,
O=C(Nc1c(C(=O)Nc2cc(C(=O)N)[nH]c2)nccn1)C(C)(C)C,VS,
O=C(N(CC(O)C)CCO)Nc1[nH]nc(OC)c1,VS,
O=C(Nc1c(O)cccc1)C(O)C(O)C(O)C(O)CO,VS, gluconic acid conjugate which mighht be a problem
O=C(NC=1C(=O)NC(C)=CC=1)C(=O)N1C(CO)CC(O)C1,VS,
Clc1cc(NC(=O)c2c(C3C[NH2+]CC3)[nH]nc2)c(O)cc1,VS,
O=C(Nc1cc(NC(=O)C)cc(C(=O)Nc2c(O)cc(C)cc2)c1)C,VS,
O=C(Cc1ccccc1)NCc1cccc2c1CN(C1CCC(=O)NC1=O)C2=O,ph4g,lenalidomide scaffold
COc1cc2c(cc1Nc1ncc(Cl)c(NC3CCCCC3NS(C)(=O)=O)n1)CCNC(=O)C2,ph4g,
O=C1CCC(N2Cc3c(CNC(=O)c4cccc(Cl)c4)cccc3C2=O)C(=O)N1,ph4g,lenalidomide scaffold
COC(=O)c1cc(Cl)ccc1NC(=O)COCC(=O)N1CCC(N2C(=O)CCC2=O)CC1,ph4g,
O=C(Nc1ccc(Br)cc1)NC1C(O)OC(CO)C(O)C1O,ph4g, comparable with gluconic amide from VS
S(=O)(=O)(NC(C(=O)Nc1cc(C#N)ccc1)C)c1ccc(N2CC[NH+](C)CC2)cc1,sim,
Fc1cc(C(O)C(=O)Nc2cc(C#N)ccc2)c(NC(=O)N2CC[NH+](C)CC2)cc1,sim,
O=C(Nc1cc(C#N)ccc1)C([NH2+]Cc1cc(N2CC[NH+](C)CC2)ccc1)C,sim, double cation - investigate protonation state at 7.4
S(CC(NC(=O)Nc1ccc(NC(=O)N2CC[NH+](C)CC2)cc1)(CO)C)C,sim,strange binding mode
S(CCC(NC(=O)N1CC[NH+](C)CC1)C(=O)Nc1cc(C#N)ccc1)C,sim,
CC(CO)(CO)NC(=O)Nc1ccccc1,GA,too close to known frag?
Cc1nc2c(C#N)ccc(O)c2[nH]1,GA,
N#Cc1ccc(O)c(Nc2nccs2)c1,GA,
CNCC1=C(C(=O)Nc2ccccc2)CCC1,GA,
CNCc1ccc(-c2cccc3[nH]c(C)nc23)nn1,GA,
Cc1ccc(C)c2oc(-c3ccc(O)nn3)cc12,GA,
O=C(Nc1ccc(O)s1)c1ccccc1,GA,
CCCCC(=O)Nc1cc(CC)ccc1O,GA,
Cc1c[nH]c(=O)n1CCC(=O)c1ccccc1,GA,
O=C(Nc1cccc(NCc2ccccc2)c1)c1ccccc1,GA,
Nc1cccc(CC(=O)Nc2ccccc2)c1,GA,
O=C(NCc1ccccc1)C1CCNC1,GA,
Cc1cccc(-c2ncc(CN)o2)c1,GA,
O=C(Nc1cccc(-c2ccccc2)c1)C1CCNCC1,GA,
[NH3+]C[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O,SBDD benchmark,

for the more visually inclined, here is the mol grid image:

We also have a bit more data per compound available so if we will want to prioritize any compounds we also have docking conformations etc available, and can also calc some descriptors and filter out anything too lipophilic et cetera. We'd be glad to hear any comments and criticisms of our set thus far!

finlayiainmaclean commented 2 years ago

Hi! Compounds are most likely to accumulate in E. coli if they contain a sterically unencumbered amine (such as a primary amine), are relatively rigid, and have low globularity. I filtered your top 30 hits to have <5 rotatable bonds, <.1 globularity and contain a primary amine), which yielded 6 compounds:

O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2C[NH2+]CC2)[nH]nc1 O=C(Nc1n[nH]c(CC)c1)c1c(C2C[NH2+]CC2)[nH]nc1 O=C(Nc1c(C(=O)Nc2cc(C(=O)N)[nH]c2)nccn1)C(C)(C)C Clc1cc(NC(=O)c2c(C3C[NH2+]CC3)[nH]nc2)c(O)cc1 Nc1cccc(CC(=O)Nc2ccccc2)c1 (available from Enamine BBV-34448717) Cc1cccc(-c2ncc(CN)o2)c1 (available from Enamine BBV-92898715)

Your top hits with additional properties are here

**WIP: I did check the synthetic routes for these 6 compounds in Postera, but it seems to remove the primary amine when searching, as pointed out by @drc007 in the comment below. I'm looking into why.

(I'm by no means an expert, this is the filtering criteria that I and other have used in various rounds of this challenge)

drc007 commented 2 years ago

@finlayiainmaclean None of the examples above appear to contain a primary amine?

finlayiainmaclean commented 2 years ago

@drc007 You're correct that searching the molecules (eg O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2C[NH2+]CC2)[nH]nc1) through Postera seems to remove the primary amines (resulting in O=C(Nc1nc2ccccc2[nH]1)c1cn[nH]c1C1CCNC1). I've ammended the above post and will see if I can generate synthetic routes without this change in structure.

dehaenw commented 2 years ago

Hi all, thank you for the synthetic routes and remarks on "rules of thumb" around accumulation in E.Coli. I read up using these references [0][1], certainly primary amines seem like a good strategy to increase accumulation, though as can be seen in [0] often a working antibiotic structure is necessary before modification to increase accumulation is considered. Some more interesting results are in [1], where they suggest a positive charge might be the most important factor. They also mention globularity and flexibility so I assume this is the paper @finlayiainmaclean is referencing. Fortunately most of these compounds upthread will have a charged secondary amine at pH 7.4 so the main important factor is there. However, they also note: Even conversion of the primary amine to an amine with more substitutions had a deleterious effect on accumulation. In fig.2 of the paper they show a few series of compounds and compare primary secondary tertiary quaternary directly, which is pretty interesting, and shows a marked but not complete decrease. They use a simple random forest model based on descriptors to model E.Coli accumulation - it might be interesting to reproduce this so people can use it here to get some more insights into their compounds or prioritize candidates. The model is available here but it relies on proprietary software to generate conformations, which are necessary for some of the descriptors https://github.com/HergenrotherLab/GramNegAccum.

[0] 10.1021/acsinfecdis.0c00715 [1] 10.1038/nature22308

dehaenw commented 2 years ago

@drc007 You're correct that searching the molecules (eg O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2C[NH2+]CC2)[nH]nc1) through Postera seems to remove the primary amines (resulting in O=C(Nc1nc2ccccc2[nH]1)c1cn[nH]c1C1CCNC1). I've ammended the above post and will see if I can generate synthetic routes without this change in structure.

Also, what happened here is just deprotonation of the (secondary) amine. Interconversion between these is pretty easy (and the major form will be protonated at pH 7.4 anyway) so it is sensible to look for routes including the freebase amine. But as @drc007 remarks, most of the molecules are secondary amines. The only primary amine in our set is [NH3+]C[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O. I think the smarts query you are using for primary amine is probably faulty, due to not accounting for charge. In case youre doing this in rdkit, here are smarts patterns for primary amines, charged and uncharged:

smis = ["[NH4+]","[NH3+]C","[NH2+](C)C","[NH+](C)(C)C","[N+](C)(C)(C)C","N","NC","N(C)C","N(C)(C)C"]
for smi in smis:
    mol = Chem.MolFromSmiles(smi)
    patt = Chem.MolFromSmarts("[NX3;!H3;H2,!H1;!H0;!$(NC=O)]") #uncharged
    patt2 = Chem.MolFromSmarts("[NX4+;!H4;H3,!H2;!H1;!H0;!$(NC=O)]") #charged
    print(mol.HasSubstructMatch(patt2) or mol.HasSubstructMatch(patt))

this should output, False,True,False,False,False,False,True,False,False because only [NH3+]C and NC are primary amines.

mattodd commented 2 years ago

Very interesting @dehaenw @drc007 @finlayiainmaclean I'd say that the accumulation in E coli via implementation of these EntryWay criteria is a "nice to have". At this stage we're interested in binders, and can engineer in some acumulation biases later. Not a problem to include now, but not necessary.

@dehaenw shall we just take a look at the top 30 you've shown above? It'd be very useful to know which can just be bought, if anyone has a quick way of sorting the molecules in that way? Otherwise @edwintse will try manually, and @danielgedder can help with considering simplest synthetic routes?

From the description of the methodology above, based on the fragment-defined pharmacophore, I'm interested to see what happens here, experimentally.

dehaenw commented 2 years ago

I would definitely just consider these 30 compounds for now. A few of these, but not all, are part of make on demand libraries. Searching on chem-space.com using the unprotonated amines:

O=C(Nc1[nH]c2c(n1)cccc2)c1c(C2CNCC2)[nH]nc1
O=C(Nc1n[nH]c(CC)c1)c1c(C2CNCC2)[nH]nc1
O=C(Nc1c(C(=O)Nc2cc(C(=O)N)[nH]c2)nccn1)C(C)(C)C
O=C(N(CC(O)C)CCO)Nc1[nH]nc(OC)c1
O=C(Nc1c(O)cccc1)C(O)C(O)C(O)C(O)CO
O=C(NC=1C(=O)NC(C)=CC=1)C(=O)N1C(CO)CC(O)C1
Clc1cc(NC(=O)c2c(C3CNCC3)[nH]nc2)c(O)cc1
O=C(Nc1cc(NC(=O)C)cc(C(=O)Nc2c(O)cc(C)cc2)c1)C
O=C(Cc1ccccc1)NCc1cccc2c1CN(C1CCC(=O)NC1=O)C2=O
COc1cc2c(cc1Nc1ncc(Cl)c(NC3CCCCC3NS(C)(=O)=O)n1)CCNC(=O)C2
O=C1CCC(N2Cc3c(CNC(=O)c4cccc(Cl)c4)cccc3C2=O)C(=O)N1
COC(=O)c1cc(Cl)ccc1NC(=O)COCC(=O)N1CCC(N2C(=O)CCC2=O)CC1
O=C(Nc1ccc(Br)cc1)NC1C(O)OC(CO)C(O)C1O
S(=O)(=O)(NC(C(=O)Nc1cc(C#N)ccc1)C)c1ccc(N2CCN(C)CC2)cc1
Fc1cc(C(O)C(=O)Nc2cc(C#N)ccc2)c(NC(=O)N2CCN(C)CC2)cc1
O=C(Nc1cc(C#N)ccc1)C(NCc1cc(N2CCN(C)CC2)ccc1)C
S(CC(NC(=O)Nc1ccc(NC(=O)N2CCN(C)CC2)cc1)(CO)C)C
S(CCC(NC(=O)N1CCN(C)CC1)C(=O)Nc1cc(C#N)ccc1)C
CC(CO)(CO)NC(=O)Nc1ccccc1
Cc1nc2c(C#N)ccc(O)c2[nH]1
N#Cc1ccc(O)c(Nc2nccs2)c1
CNCC1=C(C(=O)Nc2ccccc2)CCC1
CNCc1ccc(-c2cccc3[nH]c(C)nc23)nn1
Cc1ccc(C)c2oc(-c3ccc(O)nn3)cc12
O=C(Nc1ccc(O)s1)c1ccccc1
CCCCC(=O)Nc1cc(CC)ccc1O
Cc1c[nH]c(=O)n1CCC(=O)c1ccccc1
O=C(Nc1cccc(NCc2ccccc2)c1)c1ccccc1
Nc1cccc(CC(=O)Nc2ccccc2)c1
O=C(NCc1ccccc1)C1CCNC1
Cc1cccc(-c2ncc(CN)o2)c1
O=C(Nc1cccc(-c2ccccc2)c1)C1CCNCC1
NC[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O

yields the attached output (cleaned up by me to remove duplicates), showing about half of the compounds is in ChemSpace's catalog (which includes catalogs or subsets of them of enamine, uorsy etc).

chemspace-search-20220621110236.csv

searching ZINC, i get the following output ("source line" field corresponds to the compound number, it includes duplicates):

CL4G8Dge.csv

this is from ZINC, so I have not checked which of the compounds are actually in stock.

drc007 commented 2 years ago

Here is a sdf file containing all structures, availability from vendors, PubChem/ChEMBL ID where relevant and a quick patent search. OSAmolecules.sdf.zip

mattodd commented 2 years ago

That's great work @dehaenw and @drc007. @edwintse @danielgedder - what do you think? Any easy purchases and easy syntheses? We've about a week to order compounds in... Are there some quick wins here, particularly if we can order in compounds that are representative of any clusters that are structurally similar? @dehaenw I'm assuming there are no other criteria for ranking these?

edwintse commented 2 years ago

@mattodd Here's a summary after a quick scifinder search for the 30 compounds

Orange compounds are not purchasable but have structurally similar ones that are
Green compounds are directly purchasable
Blue compounds have reported syntheses on scifinder (patents)
Everything else didn't show any suppliers but some look easy enough to make if needed (e.g. some simple amides)

Mur Ligase Entry 5 Brooks Paige

dehaenw commented 2 years ago

Thank you @edwintse for the overview. It's good to see there are a few commercially available.

@mattodd, I think for prioritizing there are some logical compounds to pick, because they would give the most information from the pharmacophore POV: pyrrolidinopyrazoles occur 5 times, so in this group would probably be worthwhile to check one.

Regarding this class: In the two orange compounds above (from @edwintse's image) which belong to this class, the upper one's substitution is probably better, because the basic amine of the pyrrolidine is important for predicted binding (makes a h-bond to a backbone carbonyl oxygen of Gly309). Pyrrolidine to piperidine is probably an OK substitution. Unfortunately the addition of that methyl group on the benzimidazole could kill activity because of steric reasons, as you can see in the pic below there is not really place (pdb fragment in purple, originally proposed molecule without methyl on benzimidazole in green): Out of the molecules of this type, probably the most useful one to aim for is Clc1cc(NC(=O)c2c(C3C[NH2+]CC3)[nH]nc2)c(O)cc1 or even the 3-cyanophenyl derivate for ideal comparison with the known fragments. On enamine i can find Z2606031307 as the closest analog, Z2377588160 as another interesting one. This is probably not relevant currently, but it seems a lot of derivatives of these are available in the REAL catalog. The third orange compound: removing dimethyl propionyl should be an ok substitution. The fourth: removing methyls is completely fine

The commercial substances in green look good. Those are pretty diverse sensible picks.

Then regarding the remaining black compounds - which ones would give the most information, which ones are the more likely hitters. Approximately in this order my priority would be: [NH3+]C[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O because it is fairly close to the fragments and has a nice docking pose O=C(N(CC(O)C)CCO)Nc1[nH]nc(OC)c1 to explore if this heterocycle is a good replacement for the benzene ring in the fragments (substituting for the more accessbile bis hydroxyethyl substance and change the ring substituent to ethyl, cyano, a halogen should all be fine) N#Cc1ccc(O)c(Nc2nccs2)c1 to see if thiazole N can undergo the H-bond from lys backbone N O=C(Nc1ccc(Br)cc1)NC1C(O)OC(CO)C(O)C1O to see if having a sugar like group in this region of the pocket is sensible at all O=C(NC=1C(=O)NC(C)=CC=1)C(=O)N1C(CO)CC(O)C1 would be interesting if this oxalamide was active

dehaenw commented 2 years ago

one more thing. just noticed that the entry CC(CO)(CO)NC(=O)Nc1ccccc1,GA,too close to known frag?, the second green compound, is actually not close, but identical to frag 374. Sorry! (though it is good to see it still shows up within the hits.)

mattodd commented 2 years ago

OK @edwintse want to update the orange/greens based on @dehaenw suggestions? Looks like we can quickly order some here. Then for the 5 suggested "makes", are they ca one-steppers?

tbrx commented 2 years ago

From my perspective it would be nice if we could try to include some from each of the "different" generation methods (i.e. "VS","ph4g","sim", and "GA" in @dehaenw 's post up above). I sort of lost track of which molecules are which in the Orange / Green / Blue classification — is it the case that all of them are represented as orderable, or easy makes?

If the easily obtainable ones are all from "VS" (rather than from the other three) would be slightly disappointing…

dehaenw commented 2 years ago

From my perspective it would be nice if we could try to include some from each of the "different" generation methods (i.e. "VS","ph4g","sim", and "GA" in @dehaenw 's post up above). I sort of lost track of which molecules are which in the Orange / Green / Blue classification — is it the case that all of them are represented as orderable, or easy makes?

If the easily obtainable ones are all from "VS" (rather than from the other three) would be slightly disappointing…

I agree, it would be best to have one (or more) of each category, from our POV, because this would allow us at least some insight into which of the strategies is most promising. Making use of the info above, I would give the following "top 2" for each category: VS: O=C(Nc1cc(NC(=O)C)cc(C(=O)Nc2c(O)cc(C)cc2)c1)C commercially available Clc1cc(NC(=O)c2c(C3C[NH2+]CC3)[nH]nc2)c(O)cc1 "privileged scaffold" ph4g: O=C(Nc1ccc(Br)cc1)NC1C(O)OC(CO)C(O)C1O O=C1CCC(N2Cc3c(CNC(=O)c4cccc(Cl)c4)cccc3C2=O)C(=O)N1 patent structure sim: S(CCC(NC(=O)N1CC[NH+](C)CC1)C(=O)Nc1cc(C#N)ccc1)C S(=O)(=O)(NC(C(=O)Nc1cc(C#N)ccc1)C)c1ccc(N2CC[NH+](C)CC2)cc1 GA: Cc1cccc(-c2ncc(CN)o2)c1 commerically available O=C(NCc1ccccc1)C1CCNC1 commercially available SBDD Benchmark: [NH3+]C[C@@H](O)CN(C)C(=O)Nc1cc(C#N)ccc1O