Closed jiahy0825 closed 4 years ago
Hi, happy birthday! Thank you for carefully examining our code. This part of the action sampling is indeed tricky to understand. We learn this from OpenAI's implementation. It's a mathematical trick called Gumbel-max trick. Detailed explanation can be found at http://amid.fish/humble-gumbel, https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/ and https://github.com/openai/baselines/issues/358
About epsilon-greedy, note that the vanilla use of the policy gradient family (e.g., A2C) is on-policy (has to use data sampled from the current policy). Epsilon-greedy creates a bias in the data (because sometimes the action is sampled from random, not just from the current policy). You will need a correction, such as importance sampling, to make the training data unbiased for policy gradient. To avoid all this complication, the standard way of exploration for policy gradient is by increasing the entropy of action distribution and let the random sampling naturally explore. Hope these help!
btw, I think tensorflow now supports something like distribution.categorical.sample(). This should be simple to use and robust enough for this application. When we developed Decima, this option did not exist (I think).
Thank you for your quick reply and answer, which solved my doubts very well!
https://github.com/hongzimao/decima-sim/blob/c010dd74ff4b7566bd0ac989c90a32cfbc630d84/actor_agent.py#L73-L80
In my opinion, the
self.node_act_probs
in your code represents the probability of selecting each node, and then you usenoise
to explore the action space of the Reinforcement Learning problem.However, after formulating your code, I get $node_acts = argmax(\frac{p}{-log(noise)})$ , where $p \in [0, 1]$ is the probability of selecting each node, $noise \in (0, 1)$ follows normal distribution. But we know that $-log(noise) \in (0, +\infty)$, after the above calculation, maybe the choice of the next node is completely random.
Maybe my understanding is wrong, but how should I understand this implementation?
BTW, have you considered using the $\epsilon-greedy$ policy to explore the action space? And why not select this policy?