Open alexander-soare opened 2 months ago
Thanks, Alexander for sharing this issue with me. I'll start working on it
Following are the logged items. Here is my understanding of their usecase:-
Proposed new logging items:-
@MayankChaturvedi thanks for getting a start on this:
"ep" is number of episodes, and "epch" is number of epochs. I'm not sure about "eo". "updt_s" also includes the forward pass.
For 8 and 9, I think a more useful thing would be the average since last log. It's not pretty that the variance is dependent on the log frequency, but I think that's not too bad from a usability perspective. What do you think?
Part of me things we should even remove "updt_s" and "data_s" as they are relatively useless compared to the aggregate metrics.
cc @Cadene in case you want to chime in
Oops my bad "ep" not "eo". "p" and "o" are together in the keyboard 🙃 Thanks, @alexander-soare for your insights. You are right, average since the last log makes more sense than the overall average. Would one last log at the end of the training, informing overall average update time and loading time help the users? (This could be separate from this issue)
I second the thought that updt_s and data_s should be removed
@MayankChaturvedi after discussing with @Cadene here are some ideas:
Currently we log metrics for a single training step, ever
training.eval_freq
steps. The problem with this is that there may be large variance in the metrics meaning we often don't get a representative value.The worst case of this is the timing metric
data_s
which is 0 a lot of the time, but sometime non-zero and large because the dataloader is working on fetching the next set of batches. This means we don't get a good read on the data loading bottleneck. It would be better to have an aggregated metric which is the averagedata_s
per step for the lasttraining.eval_freq
steps.Of course, it's still useful to have the non-aggregated metric, so we need to think about how to do this without losing useful information.