microsoft / molecule-generation

Implementation of MoLeR: a generative model of molecular graphs which supports scaffold-constrained generation
MIT License
249 stars 37 forks source link

Is there a way to generate different molecular through the same scaffold? #74

Open whecrane opened 1 month ago

whecrane commented 1 month ago

Hello, Thank you for your nice job. I performed the MoLeR used your example successfully. When I used the sample parameter and the decoded with scaffolds, the sampled molecules are not like my scaffold. I wonder to know if I can get different molecular through the same scaffold.

Best

with load_model_from_directory(model_dir) as model: embeddings = model.encode(example_smiles) print(f"Embedding shape: {embeddings[0].shape}") decoded = model.decode(embeddings) decoded_scaffolds = model.decode(embeddings, scaffolds=["CC1C(NCC=O)=O)=O"]) sample=model.sample(10) print(f"Encoded: {example_smiles}") print(f"Decoded with scaffolds: {decoded_scaffolds}") print(f"Sample:{sample}")

kmaziarz commented 1 month ago

If you call sample, then you will get samples from the prior without considering any scaffold. If you want random molecules conditioned on a scaffold, you'd have to prepare embeddings that are reasonably close to having the scaffold and then decode those embeddings with the scaffold constraint via decode.

To get embeddings to decode, you could e.g. embed one molecule that has the scaffold and perturb its embeddings randomly, or you could even embed many molecules that have the scaffold and fit a mixture model to those embeddings and then sample from it. Finally, you could even take fully random embeddings and decode them with the scaffold constraint, but that may lead to low-quality results as the model may be confused if there is a large mismatch between what the embedding would decode to without the constraint vs with.

whecrane commented 1 month ago

Thank you for your advice, I think to perturb the embedding randomly will be OK. Thanks

whecrane commented 1 month ago

Hi, I followed your advice and add some noise. When I add the parameter 'scaffolds', the decoded always the same. Here is my code: with load_model_from_directory(model_dir) as model: embeddings = model.encode(example_smiles) print(f"Embedding shape: {embeddings[0].shape}") noise = np.random.normal(0, 0.5, embeddings[0].shape) noise = noise.astype(embeddings[0].dtype) noise_expand=np.expand_dims(noise,axis=0) noise_embedding = embeddings[0] + noise_expand decoded = model.decode(noise_embedding, scaffolds=["CCC"]) print(f"Decoded:{decoded}") I want to know how to use the scaffolds rightly.

Best

kmaziarz commented 1 month ago

What do you mean always the same, between executions of your script? I imagine the script may be deterministic because MoLeR code sets random seeds for various libraries like numpy. When I draw several random vectors I get varying results:

>>> noise = np.random.normal(0, 0.5, (5, embeddings[0].shape[-1]))
>>> noise = noise.astype(embeddings[0].dtype)
>>> noise_embedding = embeddings[0] + noise
>>> print(noise_embedding.shape)
(5, 512)
>>> model.decode(noise_embedding, scaffolds=["CCC"] * len(noise_embedding))
['CCC(C1=CC=CC=C1)C1=CC=CC=C1', 'CC(C)C1=CC=CC=C1', 'CCCC1=CC=CC=C1', 'CCC(C1=CC=CC=C1)C1=CC=CC=C1', 'CC(C)C1=CC=CC=C1']
whecrane commented 1 month ago

Thank you very much for your explanation, it works perfectly.