All benefits of using a larger batch size assume the training throughput increases?

All benefits of using a larger batch size assume the training throughput increases. If it doesn't, fix the bottleneck or use the smaller batch size.

Gradient accumulation simulates a larger batch size than the hardware can support and therefore does not provide any throughput benefits. It should generally be avoided in applied work.

Is a more stable gradient descent guaranteed by adding batch size?
In which scenarios should the gradient accumulation method be used?

google-research / tuning_playbook

All benefits of using a larger batch size assume the training throughput increases? #32