About gradient normalization

Hi,

Thanks for the great work. I think gradient normalization is a reasonable idea to extend GC. But I notice 2 points which is confused to me:

The gradient which size is greater than 1 is centralized by the mean and all the gradient (which is not filtered by the size ) are normalized by the std, is this an empirically better implementation or it is just a bug.
Also I notice the calculation dimension of mean and std in gradient normalization is different, which is not very intuitive to me.

Thanks for the reply.

lessw2020 / Ranger21