huggingface / nanotron

Minimalistic large language model 3D-parallelism training
Apache License 2.0
1.23k stars 122 forks source link

[Feature] Use CUDA event for measuring elasped time #88

Open xrsrke opened 8 months ago

xrsrke commented 8 months ago

As mentioned in the MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs, page 8

we develop a perfor- mance analysis tool that records the execution time of critical code segments on each machine rank during a run. In contrast to previous tools such as the torch profiler or the Megatron- LM timer, our tool times events based on the CUDA events method. This approach minimizes the need for CUDA syn- chronization, thus preventing performance degradation, allow- ing us to consistently run it in our production training jobs

Use torch.cuda.Event for measuring elapsed time minimize CUDA synchronization compared to time.time() [link]