Thanks for the great work. I think gradient normalization is a reasonable idea to extend GC. But I notice 2 points which is confused to me:
The gradient which size is greater than 1 is centralized by the mean and all the gradient (which is not filtered by the size ) are normalized by the std, is this an empirically better implementation or it is just a bug.
Also I notice the calculation dimension of mean and std in gradient normalization is different, which is not very intuitive to me.
Hi,
Thanks for the great work. I think gradient normalization is a reasonable idea to extend GC. But I notice 2 points which is confused to me:
The gradient which size is greater than 1 is centralized by the mean and all the gradient (which is not filtered by the size ) are normalized by the std, is this an empirically better implementation or it is just a bug.
Also I notice the calculation dimension of mean and std in gradient normalization is different, which is not very intuitive to me.
Thanks for the reply.