openai / baselines

OpenAI Baselines: high-quality implementations of reinforcement learning algorithms
MIT License
15.62k stars 4.86k forks source link

Double tf.log()'s when calculating CategoricalPd.sample() ? #172

Closed keven425 closed 6 years ago

keven425 commented 6 years ago

https://github.com/openai/baselines/blob/4993286230ac92ead39a66005b7042b56b8598b0/baselines/common/distributions.py#L155

On this line above, it takes log twice on u. Why is that the case? I thought it should only take log once, like this:

return tf.argmax(self.logits -tf.log(u), axis=-1)

@joschu

unixpickle commented 6 years ago

-log(-log(uniform)) is the Gumbel distribution. This use a Gumbel-Softmax to sample from a uniform distribution.