torchmd / torchmd-net

Training neural network potentials
MIT License
335 stars 75 forks source link

Log epoch real time in LNNP #231

Open RaulPPelaez opened 1 year ago

RaulPPelaez commented 1 year ago

This PR changes the LNNP module so that the real time since the start is logged each epoch. Allows to track training time with the CSVLogger.

RaulPPelaez commented 1 year ago

@AntonioMirarchi please review

AntonioMirarchi commented 1 year ago

It works fine. This is the time column form the metrics.csv:

0      4.779309
1      6.532793
2      8.925880
3     10.662247
4     13.062045
5     14.766587
6     17.175591
7     18.913141
8     21.361235
9     23.063034
10          NaN
Name: time, dtype: float64

The Nan on the last row is not clear to me

AntonioMirarchi commented 1 year ago

I think the Nan is due to reaching the max num epochs by the trainer, so it's not starting a new epoch but it's creating a row in the metrics regardless of it.

RaulPPelaez commented 1 year ago

I am only logging the time in on_validation_epoch_end and on_test_epoch_end. Perhaps there is some other place where data is logged?

RaulPPelaez commented 1 year ago

Antonio can you give me a yaml to reproduce this NaN thing you see? I cannot trigger it.

RaulPPelaez commented 10 months ago

I think the NaN Antonio reports comes from the test pass at the end of a train. It is considered a new "training" and for some reason the first time has NaN.

RaulPPelaez commented 10 months ago

Tried to change the logging of the epoch to integer, but this produces a warning:

 You called `self.log('epoch', ...)` in your `on_validation_epoch_end` but the value needs to be floating to be reduced. Converting it to torch.float32. You can silence this warning by converting the value to floating point yourself. If you don't intend to reduce the value (for instance when logging the global step or epoch) then you can use `self.logger.log_metrics({'epoch': ...})` instead.

There is some (really unconvincing imo) discussion on why one cannot log an integer https://github.com/Lightning-AI/pytorch-lightning/issues/18739 The alternative suggested in the warning itself does not work as one would expect, you get a new line in metrics.csv in which everything is empty except epoch. cc @peastman