torchmd / torchmd-net

Training neural network potentials
MIT License
335 stars 75 forks source link

Loss Function vs epoch plot while training #338

Open Awesomium10 opened 3 months ago

Awesomium10 commented 3 months ago

We are training a model and require a plot of loss function with epochs while the model is training. How can we do this with torchmd?

RaulPPelaez commented 3 months ago

When running torchmd-train (see documentation here or here for a more advance approach) a file called metrics.csv will be generated inside the logdir (along with checkpoints and other information).

The metrics.csv file will be similar to this

epoch,lr,step,train_neg_dy_mse_loss,train_total_mse_loss,train_y_mse_loss,val_neg_dy_l1_loss,val_neg_dy_mse_loss,val_total_l1_loss,val_total_mse_loss,val_y_l1_loss,val_y_mse_loss
20.0,0.0005000000237487257,1357436,0.009475486353039742,5.106609344482422,0.012685230001807213,0.05106592923402786,0.010068569332361221,5.183582305908203,1.0168300867080688,0.07698939740657806,0.009973234497010708
21.0,0.0005000000237487257,1357593,0.012938769534230232,5.229669570922852,0.013254974037408829,0.04567231982946396,0.007519902195781469,4.6268630027771,0.7586674094200134,0.05963057279586792,0.006677159108221531
22.0,0.0005000000237487257,1357750,0.011407813988626003,5.110054969787598,0.01159985177218914,0.043068308383226395,0.00
....

This file contains, among others, the information you request (epoch and different losses).

You can plot this from a python script, for instance:

import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('/path/to/metrics.csv')
plt.figure(figsize=(10, 6))
plt.plot(df['epoch'], df['train_total_mse_loss'], marker='o')
plt.title('Epoch vs Train Total MSE Loss')
plt.xlabel('Epoch')
plt.ylabel('Train Total MSE Loss')
plt.grid(True)
plt.show()

Note that we also provide integration with some popular frameworks for ML training visualization https://torchmd-net.readthedocs.io/en/latest/torchmd-train.html#cmdoption-torchmd-train-wandb-use

Awesomium10 commented 3 months ago

Hi! We trained the data of alanine dipeptide using torchmd with a train.yaml configuration file. Our data file (metrics.csv) has a set of different loss functions. But we are unaware about the loss function which we need to minimise. So we chose an arbitrary loss function (train_total_mse_loss) from among those given in the data file, and observed the plot of loss vs number of epochs. It was not a decreasing plot, but highly irregular. How do we know the loss function involved in the algorithm and how can we toggle between them if it is possible?

RaulPPelaez commented 3 months ago

The metrics.csv file contains the losses for the energy (y), the forces (neg_dy) and total (sum of both) for the training, validation and test sets. So for instance, if you chose MSE loss as function, the loss on the energy for the training set is denoted train_y_mse_loss. Its hard to pin point why your loss is not going down just from the information you provided. Could you share configuration?

Awesomium10 commented 3 months ago

This is the train.yaml file we are using. train.yaml.txt

RaulPPelaez commented 3 months ago

Your network seems to be very barebones (you are disabling neighbor embedding, for instance), you are also choosing the defaults for parameters such as the cutoff. I am inclined to believe this is a matter of hypeparameters. You seem to be trying to adapt this configuration file https://github.com/torchmd/torchmd-cg/blob/master/tutorial/train.yaml for a more recent version of the project. That repo is long before my time I am afraid, I am not familiar with the Graph Network and the old iterations of it as to tell you the translation for each default/parameter. Perhaps the current documentation of this network will help https://torchmd-net.readthedocs.io/en/latest/models.html#graph-network