Some inconsistencies between code and paper

stan-his / DeepFMPO

Code accompanying the paper "Deep reinforcement learning for multiparameter optimization in de novo drug design"

MIT License

67 stars 27 forks source link

I have been digging into your paper and code, and noticed some potential discrepancies between the paper and the code. I would appreciate it very much if you could clarify. 1) in training.py line 84

                for i in range(batch_mol.shape[0]):

                    # If molecule was modified
                    if not np.all(org_mols[i] == batch_mol[i]):

                        fr = evaluate_mol(batch_mol[i], e, decodings)
                        frs.append(fr)
                        rewards[i] += np.sum(fr * dist)

isn't the reward updated by rewards[i] += np.sum(fr * 1/(1+dist)) according to the paper(Eq. 12)?

2) In models.py Actor and Critic have different learning rates (0.0005 and 0.0001), but the paper mentions that they both have the same learning rate (page 19). I would like to know the reason for different learning rates, and if there are any more parameter changes, would appreciate it if you could tell me.

3) in training.py line 112

            for i in range(BATCH_SIZE):

                a = int(actions[i])
                loss = -np.log(probs[i,a]) * td_error[i]
                target_actor[i,a] = td_error[i]

The variable loss is not used anywhere. From https://www.freecodecamp.org/news/an-intro-to-advantage-actor-critic-methods-lets-play-sonic-the-hedgehog-86d6240171d/ I see that the policy update should include the logarithm. So should the target_actor[i,a] = loss be the correct update?

1) Thank you for noticing this. This is an error in the code and I will fix it quickly. The variable dist is currently implementing as a weighted moving average of 1/(1+d) where d is the frequency of a given property. The difference to the paper is that in the paper, the weighted moving average is calculated over the d variable instead. As long as the beta parameter is large and that the frequencies don't varies a lot in every time step this difference would not be noticeable.

2) The learning rate 0.0001 was used to produce the result in the paper. When continuing to work with the code, we noticed that the same result could be achieved, but with a more stable optimization process if the critics learning rate was increased to 0.0005.

3) The loss variable was only used for debugging and was printed in an early implementation of the algorithm. The log probability is used to find the optimal policy and is implemented in the same way as in the blog you referee to (see line 22 in models.py).

stan-his / DeepFMPO

Some inconsistencies between code and paper #3