tummfm / difftre

Learning neural network potentials from experimental data via Differentiable Trajectory Reweighting
Apache License 2.0
29 stars 9 forks source link

Reporting NAN when training CG_water model #7

Open XinjianOUYANG opened 4 months ago

XinjianOUYANG commented 4 months ago

Hi, I am recently trying to reproduce your results by running the CG_water.ipynb notebook, and it raised the NAN error during the trainging as following: Step 0 in 90.34 sec Loss = 0.20526731
Step 1 in 854.44 sec Loss = 0.19676176 Step 2 in 854.85 sec Loss = 0.18241243 Step 3 in 858.79 sec Loss = 0.12344666 Step 4 in 856.21 sec Loss = 0.09613695 Step 5 in 880.87 sec Loss = nan Do you ever have this issue and could you give some advices to solve this problem? Thanks a lot!

S-Thaler commented 3 months ago

Hi, thanks for your interest in DiffTRe. I did not experience NaNs with the hyperparameters provided in the notebook. Usually, the main source of NaN values comes from simulations that sample in unphysical regions, mostly when 2 atoms overlap. You can check this by computing pairwise distances for each frame in the trajectory that becomes NaN. If atoms start to overlap, you could increase the strength of the prior potential to stop this from happening. This is the first thing I would try. Another standard thing to try is to reduce the learning rate. I've increased it as much as I could to converge with the least amount of updates to save compute. Taking a smaller LR is usually a bit safer. The reweighting also introduces some amount of numerical noise, performing it in float64 is often a good measure.