Open RaulPPelaez opened 1 year ago
@AntonioMirarchi please review
It works fine. This is the time column form the metrics.csv:
0 4.779309
1 6.532793
2 8.925880
3 10.662247
4 13.062045
5 14.766587
6 17.175591
7 18.913141
8 21.361235
9 23.063034
10 NaN
Name: time, dtype: float64
The Nan on the last row is not clear to me
I think the Nan is due to reaching the max num epochs by the trainer, so it's not starting a new epoch but it's creating a row in the metrics regardless of it.
I am only logging the time in on_validation_epoch_end
and on_test_epoch_end
. Perhaps there is some other place where data is logged?
Antonio can you give me a yaml to reproduce this NaN thing you see? I cannot trigger it.
I think the NaN Antonio reports comes from the test pass at the end of a train. It is considered a new "training" and for some reason the first time has NaN.
Tried to change the logging of the epoch to integer, but this produces a warning:
You called `self.log('epoch', ...)` in your `on_validation_epoch_end` but the value needs to be floating to be reduced. Converting it to torch.float32. You can silence this warning by converting the value to floating point yourself. If you don't intend to reduce the value (for instance when logging the global step or epoch) then you can use `self.logger.log_metrics({'epoch': ...})` instead.
There is some (really unconvincing imo) discussion on why one cannot log an integer https://github.com/Lightning-AI/pytorch-lightning/issues/18739 The alternative suggested in the warning itself does not work as one would expect, you get a new line in metrics.csv in which everything is empty except epoch. cc @peastman
This PR changes the LNNP module so that the real time since the start is logged each epoch. Allows to track training time with the CSVLogger.