possible redundancy in _next_step_and_evaluate

jrwnter / mso

Implementation of the method proposed in the paper "Efficient Multi-Objective Molecular Optimization in a Continuous Latent Space" by Robin Winter, Floriane Montanari, Andreas Steffen, Hans Briem, Frank Noé and Djork-Arné Clevert

MIT License

76 stars 21 forks source link

possible redundancy in _next_step_and_evaluate #2

Closed pcko1 closed 5 years ago

pcko1 commented 5 years ago

Is there a reason why this cyclic conversion takes place: swarm.x -> swarm.smiles-> swarm.x and not update swarm.x directly? In other words, is line 70 really necessary?

https://github.com/jrwnter/mso/blob/992b46dcb4f7f4ae9027489a8ee46cd0a928c4fc/mso/optimizer.py#L68-L71

jrwnter commented 5 years ago

This step places the particle at the center of a molecule (SMILES) emedding. This means a particle can only jump from one molecule to another during optimization. If a particle's position update does not result into a point in space corresponding to a new molecule, it basically gets reset to the previous point. I included this cyclic step as I found that in some cases the optimizer was able to "optimize" a scoring function (e.g. QSAR model) by shifting the position without actually changing the molecule...

pcko1 commented 5 years ago

Hmm, by "corresponding to a new molecule" you mean a molecule different to the previous one or simply any valid molecule?

jrwnter commented 5 years ago

What I mean with this, is that the position in the CDDD space will change (slightly) but it will decode back to the same SMILES. Thus, its still a valid molceule (the penalty will only cover invalid SMILES). I just want to avoid optimizing the scoring function from say 0.2 to 0.8 while not changing the actual molecule (only its latent representation). Unfortunately, this can actually happen if a molecule correspond to a larger region in the CDDD space and the scoring function (e.g. a QSAR model that takes points in this space as input) is not well defined in this region.... This is particularly problematic if you include a harsh substructure constrains and most of the molecules in the neighbourhood get penalized....

jrwnter commented 5 years ago

Hmm, by "corresponding to a new molecule" you mean a molecule different to the previous one or simply any valid molecule?

a molecule different to the previous one

pcko1 commented 5 years ago

Unfortunately, this can actually happen if a molecule correspond to a larger region in the CDDD space and the scoring function (e.g. a QSAR model that takes points in this space as input) is not well defined in this region....

ah now I understand, so you have trained your QSAR model on CDDD points and that model is very sensitive to the location of the particles, meaning that even if two locations correspond to the same underlying molecule, the QSAR molecule will give different scores! So to my understanding, if a QSAR model (trained on CDDD points) is not used in the cost function, this step can still be omitted right?

jrwnter commented 5 years ago

yeah... that should be true. Which is nice, because this step is computational expensive... a flag for turning this on/off would be nice then...