hardmaru / estool

Evolution Strategies Tool
Other
933 stars 162 forks source link

PEPG questions #6

Closed cammckenzie closed 6 years ago

cammckenzie commented 6 years ago

hey Hardmaru, Thanks for creating your ES blogs, they've been really interesting.

I had a couple of quick and probably silly questions about your PEPG implementation.

I've read the original paper and attempted to implement the PEPG algorithm with symmetric sampling. I noticed a couple of differences to your implementation, and I was wondering if you could enlighten me.

In es.py where you're updating the mean, I notice that the calculated gradient does not get normalized by the batch size.

  rT = (reward[:self.batch_size] - reward[self.batch_size:])
  change_mu = np.dot(rT, epsilon)
  self.optimizer.stepsize = self.learning_rate
  update_ratio = self.optimizer.update(-change_mu) # adam, rmsprop, momentum, etc.

I guess that if you tune the learning rate according to batch size this is not an issue, but I was just wondering why you took this approach?

Also, where you're making the symmetric samples:

self.epsilon = np.random.randn(self.batch_size, self.num_params) * self.sigma.reshape(1, self.num_params)

You're sampling from a uniform distribution. Is there a reason that you take this approach rather than sampling from a normal distribution. Also, and I may be completely wrong here, as the uniform distribution is taken from [0,1), doesn't this mean that your parameters will always larger than the mean (for the + symmetric case) and always smaller than the mean (for the - symmetric case). You won't end up with the case where there's a mix of some parameters above the mean and some below.

I'm a completely self taught beginner to this stuff, so apologies for the naive questions. cheers

hardmaru commented 6 years ago

Hi @cammckenzie

Thanks for the message and interest.

For your first question regarding normalized by batch size, in the PEPG paper, in the right side of "Algorithm 1" on page 7 that I tried to base the implementation on, there is no scaling for population size. I guess the difference can be adjusted by adjusting the learning rate.

For the second point, I am not sampling from a uniform distribution, but sampling from a normal distribution using np.random.randn as you mentioned.

self.epsilon = np.random.randn(self.batch_size, self.num_params) * self.sigma.reshape(1, self.num_params)

Thanks

cammckenzie commented 6 years ago

Thanks for the quick reply @hardmaru, much appreciated.

In regards to point 1, I appear to have misinterpreted the paper. I did notice that the algorithm doesn't appear to have any scaling for the standard deviation term either, but your implementation seems to scale by the batch size?

delta_sigma = (np.dot(rS, S)) / (2 self.batch_size stdev_reward)

Anyway, ultimately, I don't think that it's very important. Was there another paper that had the additional tricks that you implemented (annealing learning rates etc.) or you just pulled them from the standard machine learning approaches?

You're quite right on point 2, I'm not sure how I interpreted that as a uniform sample. cheers

hardmaru commented 6 years ago

You're right, I incorporated scaling by batch (or population) size so that the same learn rate parameter can be used for various different population sizes (I think the paper should have done that too, just an obvious thing to do). The annealing and other tricks are standard tricks in deep learning, I recommend going through OpenAI's ES paper as well - they used Adam rather than vanilla SGD for instance.

cammckenzie commented 6 years ago

Thanks again @hardmaru, looking forward to your next blog whatever that may be.