[Feature] Use CUDA event for measuring elasped time

As mentioned in the MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 8

we develop a perfor- mance analysis tool that records the execution time of critical code segments on each machine rank during a run. In contrast to previous tools such as the torch profiler or the Megatron- LM timer, our tool times events based on the CUDA events method. This approach minimizes the need for CUDA syn- chronization, thus preventing performance degradation, allow- ing us to consistently run it in our production training jobs

Use torch.cuda.Event for measuring elapsed time minimize CUDA synchronization compared to time.time() [link]

huggingface / nanotron

[Feature] Use CUDA event for measuring elasped time #88