The computation of scores for molecules is taking long time

545487677 commented 4 months ago

Hi, Great work on implementing this method!!! The logic and structure are very clear and concise. However, I have encountered significant performance issues with the compute_score method in our code. The computation of scores for molecules is taking long time.

@classmethod def compute_score(cls, molecules: tuple[str], scoring_fn: Callable[[str], float]) -> float: """Computes the score of the molecules.

:param molecules: A tuple of SMILES. The first element is the currently constructed molecule
                  while the remaining elements are the building blocks that are about to be added.
:param scoring_fn: A function that takes as input a SMILES representing a molecule and returns a score.
:return: The score of the molecules.
"""
return sum(scoring_fn(molecule) for molecule in molecules) / len(molecules) if len(molecules) > 0 else 0.0

@cached_property def P(self) -> float: """The property score of this Node. (Note: The value is cached, so it assumes the Node is immutable.)""" return self.compute_score(molecules=self.molecules, scoring_fn=self.scoring_fn)

Problem: The execution time for the compute_score method is very long, especially when calculating scores for a large number of molecules.

Questions:

1.  Is this part of the code being executed in a multithreaded manner?
2.  How many CPU cores are being utilized when running this method?
3.  Are there any recommended optimizations to reduce the computation time for the compute_score method?

swansonk14 commented 4 months ago

Thank you for raising the issue! The compute_score function is generally one of the most time-consuming parts of the generative process since it's the part that actually involves running a machine learning model to predict the score of the proposed molecule. This can involve both computing molecular features (RDKit fingerprints) and running the ML model (either a random forest or a Chemprop graph neural network). On a per molecule basis, it should only take maybe 1-2 seconds, but given the large number of molecules that are proposed and scored, the time does add up.

To answer your questions:

The code I wrote is always single threaded, but I believe that when the Chemprop GNN model is run, PyTorch automatically uses multiple CPUs under the hood.
Everything in SyntheMol is single-threaded so just one CPU core is used for basically everything, but again, when the Chemprop GNN is scoring a molecule, I believe PyTorch will use as many CPUs as are available.
The main recommendation would be to use a faster model to predict the molecular property that you're interested in. For instance, computing RDKit fingerprints is actually surprisingly slow, so using a model that doesn't rely on those fingerprints (e.g., a pure Chemprop model rather than a Chemprop-RDKit model or random forest model) will make it faster. Replacing a Chemprop GNN with a multilayer perceptron on easy-to-compute fingerprints like Morgan fingerprints could be even faster. I try to do as much caching of molecule scores as possible so there shouldn't be any extraneous computation, so you'll just have to optimize the predictor for speed as much as possible.

Out of curiosity, what model type are you using and how long is the compute_score method taking? I also want to make sure it's not taking an unreasonably long time due to a different issue.

545487677 commented 4 months ago

Thank you for your information!!! I used the chemprop to generate molecules for around 10 hours.

swansonk14 commented 4 months ago

Okay that sounds about right in terms of the timing. Please let me know if you have any other questions!

swansonk14 / SyntheMol

The computation of scores for molecules is taking long time #18