Closed ShriyaPalsamudram closed 3 months ago
Upgrade Dockerfile to use newer pytorch container to ensure that the reference runs on NVIDIA Hopper GPUs
Verified that the new changes do not impact convergence -
GBS = 2048 Original megatron-lm RCP samples from v4.0 RCPs = [1207959552, 1207959552, 1157627904] One seed convergence samples = [1157627904]
GBS = 3072 Original megatron-lm RCP samples from v4.0 RCPs = [1207959552, 1207959552, 1207959552] One seed convergence samples = [13790871552]
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅
Upgrade Dockerfile to use newer pytorch container to ensure that the reference runs on NVIDIA Hopper GPUs
Verified that the new changes do not impact convergence -
GBS = 2048 Original megatron-lm RCP samples from v4.0 RCPs = [1207959552, 1207959552, 1157627904] One seed convergence samples = [1157627904]
GBS = 3072 Original megatron-lm RCP samples from v4.0 RCPs = [1207959552, 1207959552, 1207959552] One seed convergence samples = [13790871552]