Open pzdkn opened 3 years ago
I see, I guess it doesn't matter much as it is just a constant scaling. Still, without the scaling, the estimator would be biased.
🤝 I also noticed this issue today. Without dividing $\sigma$, the gradient estimation is around 100x smaller in magnitude (since sigma is usually at 0.01 magnitude). That explains why in the example configuration, the “stepsize: 0.01” is relatively larger. I am curious about the rationale as well.
@zxymark221 , I guess in the end it just gets absorbed into the learning rate, since sigma is constant. However, I think the gradient won't be an unbiased estimate anymore.
According to the paper on page 3, Algorithm 2, the gradient in line 11 is rescaled by the standard deviation. However I can't see it in the code in: https://github.com/openai/evolution-strategies-starter/blob/master/es_distributed/es.py#L247
What is the rationale behind this ? That it is included into the LR ?