mir-group / allegro

Allegro is an open-source code for building highly scalable and accurate equivariant deep learning interatomic potentials
https://www.nature.com/articles/s41467-023-36329-y
MIT License
341 stars 45 forks source link

How can I modify Training and loss forces of my system specially Li atoms in my LGPO system? #76

Open ElhamPisheh opened 8 months ago

ElhamPisheh commented 8 months ago

Dear all, I have recently Used Nequip-Allegro Framework to retrain DFT data in the LGPO systems. I have used 30000 configurations for retraining DFT data.By using the ASE calculator generated ML forces to compare the forces with DFT forces .Unfortunately the Loss of forces did not improve at all. I have changed rcutoff from 5 to 7 and 14.I have Changed Max epoch from 100 to 200. I have changed batch_size from 1 to 4 and 6 . I have changed different splits (80-20 and 70-30) for training and validations.I have checked Whether to shuffle the training data or not. I have checked Themathematical expression for the overall LOSS and changed the force loss coefficient of 1.0 to 100. . I have checked different seeds to have different training and validation sets. I have checked different lmax=1 and 2.

I tried to do anything to modify ML forces (loss_f,loss_e and loss) specially for Li atoms. The total loss still remains near 23 and loss_f is near 0.23 with the force coefficient of 100. and the total loss can be modified to 0.23 with the force coefficient of 1. and loss_f is still unchanged and near 0.23 .However,these improvements seem not successful and still large.

Do you have any new ideas to help me improve forces and overall results? Name Epoch wal (hours) LR loss_f loss_e loss f_mae f_rmse e_mae e/N_mae Train 200 9.969141667 0.002 0.23 0.0368 23.0 0.213 0.485 0.867 0.0173 Validation 200 9.969141667 0.002 0.204 0.000193 20.4 0.200 0.456 0.699 0.014

ElhamPisheh commented 8 months ago

No suggestion for addressing my issue???

DavidW99 commented 8 months ago

From my personal experience, keeping batch size small like 1 or 4 is good practice in this framework and I have seen increasing batch size decreasing the performance. I would suggest keeping l_max = 2 or 3 as more angular resolution provides better accuracy.

You may then try tuning the architecture by increasing num_layers: 2, 4; increasing num_tensor_features: 32, 64 and also adjusting two_body_latent_mlp_latent_dimensions and latent_mlp_latent_dimensions accordingly; This will provide more channels. adjusting learning_rate: 0.005, 0.001;

Hope this can help!

ElhamPisheh commented 8 months ago

Dear David,

I will check all of them and let all know the outcome.

Thanks a lot, Best Regards, Elham

ElhamPisheh commented 8 months ago

From my personal experience, keeping batch size small like 1 or 4 is good practice in this framework and I have seen increasing batch size decreasing the performance. I would suggest keeping l_max = 2 or 3 as more angular resolution provides better accuracy.

You may then try tuning the architecture by increasing num_layers: 2, 4; increasing num_tensor_features: 32, 64 and also adjusting two_body_latent_mlp_latent_dimensions and latent_mlp_latent_dimensions accordingly; This will provide more channels. adjusting learning_rate: 0.005, 0.001;

Hope this can help!

Dear David,

I have a problem due to the time limitation of my system.It is 2 days and after that, the server will stop our job.

For example I could only calculate 80 epochs of 200 after 2 days. Is there any way to start the job from where it left off??? To start from 81 epochs for example?

Thanks for your time, Elham

DavidW99 commented 8 months ago

Hi Elham,

Thanks for your question! Allegro will restart from the best model saved from the previous run when you keep the same run_name in your config file. In your case, the best model should be result at the 80th epoch as I assume the loss has not plateaued yet. You can also set append: true as in the example.yaml so the log will be appended.

ElhamPisheh commented 8 months ago

Dear David,

Thanks for your guidance and your time.