lessw2020 / Ranger21

Ranger deep learning optimizer rewrite to use newest components
Apache License 2.0
321 stars 45 forks source link

About gradient normalization #29

Open julightzhong10 opened 3 years ago

julightzhong10 commented 3 years ago

Hi,

Thanks for the great work. I think gradient normalization is a reasonable idea to extend GC. But I notice 2 points which is confused to me:

  1. The gradient which size is greater than 1 is centralized by the mean and all the gradient (which is not filtered by the size ) are normalized by the std, is this an empirically better implementation or it is just a bug.

  2. Also I notice the calculation dimension of mean and std in gradient normalization is different, which is not very intuitive to me.

Thanks for the reply.