As mentioned in the MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 8
we develop a perfor- mance analysis tool that records the execution time of critical code segments on each machine rank during a run. In contrast to previous tools such as the torch profiler or the Megatron- LM timer, our tool times events based on the CUDA events method. This approach minimizes the need for CUDA syn- chronization, thus preventing performance degradation, allow- ing us to consistently run it in our production training jobs
Use torch.cuda.Event for measuring elapsed time minimize CUDA synchronization compared to time.time()[link]
As mentioned in the MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 8
Use
torch.cuda.Event
for measuring elapsed time minimize CUDA synchronization compared totime.time()
[link]