pytorch / torchtitan

A native PyTorch Library for large model training
BSD 3-Clause "New" or "Revised" License
1.29k stars 115 forks source link

Fix the incorrect step log for profiler after resuming from a checkpoint #293

Closed fegin closed 2 months ago

fegin commented 2 months ago

Summary: The profiler currently maintains a counter locally and that counter is not synchronized with the checkpointed train step. This PR fixes the issue.

fegin commented 2 months ago

In some trainers, the profiling is designed to be done only once and the global step is used to prevent profiling from happening after checkpoint resume.