Closed growlix closed 2 years ago
Not sure if this is the whole problem, but on WandB, if I switch the x-axis to trainer/global_step
, which is what I think we explicitly log, then the problem disappears. I wonder, what is _step
and how is it tracked exactly?
Also, I wonder, how often does accuracy/val
show up in the raw data? It should hopefully only show up 90 times, if it's more than that then something funky is going on.
Good call on examining _step
vs. trainer/global_step
: the final (90th) epoch not only doesn't have an epoch
, it doesn't have a trainer/global_step
. So it seems like plotting with trainer/global_step
or epoch
on the x-axis excludes the final epoch validation accuracy; plotting with _step
is the only way to include the final epoch training accuracy.
I'm updating the title and summary accordingly!
Update It's been determined that the final epoch validation accuracy log entry doesn't have
epoch
ortrainer/global_step
keys. Accordingly, plottingaccuracy/val
againsttrainer/global_step
orepoch
on the x-axis excludes the final epoch validation accuracy; plotting with_step
is the only way to include the final epoch training accuracy. Currently trying to diagnose the problem.Environment
pytorch_interal
+dev
Example W&B runs** To reproduce Train a model to convergence. Plot accuracy/val vs. epoch and accuracy/val vs. step, and compare the final accuracy.
Expected behavior
The final value of accuracy/val should be identical regardless of which unit of time is on the x-axis.
Additional context
I'm not totally sure why this is happening. I examined the WandB logs for the runs linked above. When filtering entries that contain the
"accuracy/val"
key, every entry except the final one also has an"epoch"
key. I show the final two"accuracy/val"
-containing entries for one of the runs below:The second to last entry has an
"epoch"
key, while the final does not.@abhi-mosaic, @hanlint, @jbloxham thoughts? (just guessing who might have experience w/ logging)