Open Synray opened 1 year ago
Instead of averaging every parameter's gradient at the end, just average the output gradient at the start, reducing the number of divisions. This is equivalent because the 1/n term propagates backwards to all the gradients.
1/n
Instead of averaging every parameter's gradient at the end, just average the output gradient at the start, reducing the number of divisions. This is equivalent because the
1/n
term propagates backwards to all the gradients.