When training with 8 GPU, the throughput printed by Deepspeed is much smaller than throughput calculated by training code:
deepspeed SamplesPerSec=505
sample_per_sec: 50120
It seems that the throughput calculated by training code = throughput printed by Deepspeed * gradient_steps
Which number is accurate? @lucidrains @janEbert
When training with 8 GPU, the throughput printed by Deepspeed is much smaller than throughput calculated by training code: deepspeed SamplesPerSec=505 sample_per_sec: 50120
It seems that the throughput calculated by training code = throughput printed by Deepspeed * gradient_steps Which number is accurate? @lucidrains @janEbert