mosaicml / composer

Supercharge Your Model Training
http://docs.mosaicml.com
Apache License 2.0
5.15k stars 417 forks source link

Unexpected logging behavior for final validation epoch #626

Closed growlix closed 2 years ago

growlix commented 2 years ago

Update It's been determined that the final epoch validation accuracy log entry doesn't have epoch or trainer/global_step keys. Accordingly, plotting accuracy/val against trainer/global_step or epoch on the x-axis excludes the final epoch validation accuracy; plotting with _step is the only way to include the final epoch training accuracy. Currently trying to diagnose the problem.

Environment pytorch_interal + dev Example W&B runs

** To reproduce Train a model to convergence. Plot accuracy/val vs. epoch and accuracy/val vs. step, and compare the final accuracy.

Expected behavior

The final value of accuracy/val should be identical regardless of which unit of time is on the x-axis.

Additional context

I'm not totally sure why this is happening. I examined the WandB logs for the runs linked above. When filtering entries that contain the "accuracy/val" key, every entry except the final one also has an "epoch" key. I show the final two "accuracy/val"-containing entries for one of the runs below:

{'accuracy/val': 0.771399974822998, 'trainer/global_step': 55625, 'crossentropyloss/val': 0.9089045524597168, '_step': 55625, '_runtime': 11899, 'wall_clock_train': 11398.399091243744, 'loss/train': 0.8479165434837341, 'lr-DecoupledSGDW/group0': 0.0007514306039705737, 'trainer/batch_idx': 0, 'epoch': 89, 'throughput/epoch': 9994.72538837262, 'throughput/step': 10257.778616345291, '_timestamp': 1644973588}

{'accuracy/val': 0.7707800269126892, 'crossentropyloss/val': 0.910305380821228, '_step': 56250, '_runtime': 12030, 'wall_clock_train': 11526.369808673859, 'lr-DecoupledSGDW/group0': 0, 'throughput/epoch': 10002.288224249523, 'throughput/step': 10258.344327400502, '_timestamp': 1644973719}

The second to last entry has an "epoch" key, while the final does not.

@abhi-mosaic, @hanlint, @jbloxham thoughts? (just guessing who might have experience w/ logging)

abhi-mosaic commented 2 years ago

Not sure if this is the whole problem, but on WandB, if I switch the x-axis to trainer/global_step, which is what I think we explicitly log, then the problem disappears. I wonder, what is _step and how is it tracked exactly?

Also, I wonder, how often does accuracy/val show up in the raw data? It should hopefully only show up 90 times, if it's more than that then something funky is going on.

growlix commented 2 years ago

Good call on examining _step vs. trainer/global_step: the final (90th) epoch not only doesn't have an epoch , it doesn't have a trainer/global_step. So it seems like plotting with trainer/global_step or epoch on the x-axis excludes the final epoch validation accuracy; plotting with _step is the only way to include the final epoch training accuracy.

I'm updating the title and summary accordingly!