mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.57k stars 548 forks source link

Update megatron-lm reference to run on hopper gpus #748

Open ShriyaPalsamudram opened 2 weeks ago

ShriyaPalsamudram commented 2 weeks ago

Upgrade Dockerfile to use newer pytorch container to ensure that the reference runs on NVIDIA Hopper GPUs

Verified that the new changes do not impact convergence -

GBS = 2048 Original megatron-lm RCP samples from v4.0 RCPs = [1207959552, 1207959552, 1157627904] One seed convergence samples = [1157627904]

GBS = 3072 Original megatron-lm RCP samples from v4.0 RCPs = [1207959552, 1207959552, 1207959552] One seed convergence samples = [13790871552]

github-actions[bot] commented 2 weeks ago

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅