Closed marcociccone closed 6 years ago
Hi. Exactly! The gradients are smoother when averaged across many training examples.
Ok thanks! This is a smart idea, but I think it doesn't work in combination with batchnorm or other normalization methods where you want to compute the statistics over the batch.
Yes, that's true. We did not include batchnorm in our code, though. I guess it is a good compromise when GPU memory gets full.
Hi, can you please clarify on why you average the gradients every n iterations? Is it a way to increase minibatch size if the batch does not fit in memory? Thanks!