Open whecrane opened 1 month ago
If you call sample
, then you will get samples from the prior without considering any scaffold. If you want random molecules conditioned on a scaffold, you'd have to prepare embeddings that are reasonably close to having the scaffold and then decode those embeddings with the scaffold constraint via decode
.
To get embeddings to decode, you could e.g. embed one molecule that has the scaffold and perturb its embeddings randomly, or you could even embed many molecules that have the scaffold and fit a mixture model to those embeddings and then sample from it. Finally, you could even take fully random embeddings and decode them with the scaffold constraint, but that may lead to low-quality results as the model may be confused if there is a large mismatch between what the embedding would decode to without the constraint vs with.
Thank you for your advice, I think to perturb the embedding randomly will be OK. Thanks
Hi, I followed your advice and add some noise. When I add the parameter 'scaffolds', the decoded always the same. Here is my code: with load_model_from_directory(model_dir) as model: embeddings = model.encode(example_smiles) print(f"Embedding shape: {embeddings[0].shape}") noise = np.random.normal(0, 0.5, embeddings[0].shape) noise = noise.astype(embeddings[0].dtype) noise_expand=np.expand_dims(noise,axis=0) noise_embedding = embeddings[0] + noise_expand decoded = model.decode(noise_embedding, scaffolds=["CCC"]) print(f"Decoded:{decoded}") I want to know how to use the scaffolds rightly.
Best
What do you mean always the same, between executions of your script? I imagine the script may be deterministic because MoLeR code sets random seeds for various libraries like numpy
. When I draw several random vectors I get varying results:
>>> noise = np.random.normal(0, 0.5, (5, embeddings[0].shape[-1]))
>>> noise = noise.astype(embeddings[0].dtype)
>>> noise_embedding = embeddings[0] + noise
>>> print(noise_embedding.shape)
(5, 512)
>>> model.decode(noise_embedding, scaffolds=["CCC"] * len(noise_embedding))
['CCC(C1=CC=CC=C1)C1=CC=CC=C1', 'CC(C)C1=CC=CC=C1', 'CCCC1=CC=CC=C1', 'CCC(C1=CC=CC=C1)C1=CC=CC=C1', 'CC(C)C1=CC=CC=C1']
Thank you very much for your explanation, it works perfectly.
Hello, Thank you for your nice job. I performed the MoLeR used your example successfully. When I used the sample parameter and the decoded with scaffolds, the sampled molecules are not like my scaffold. I wonder to know if I can get different molecular through the same scaffold.
Best
with load_model_from_directory(model_dir) as model: embeddings = model.encode(example_smiles) print(f"Embedding shape: {embeddings[0].shape}") decoded = model.decode(embeddings) decoded_scaffolds = model.decode(embeddings, scaffolds=["CC1C(NCC=O)=O)=O"]) sample=model.sample(10) print(f"Encoded: {example_smiles}") print(f"Decoded with scaffolds: {decoded_scaffolds}") print(f"Sample:{sample}")