hongzimao / decima-sim

Learning Scheduling Algorithms for Data Processing Clusters
https://web.mit.edu/decima/
290 stars 90 forks source link

Questions about the node_acts & job_acts selection. #19

Closed jiahy0825 closed 4 years ago

jiahy0825 commented 4 years ago

https://github.com/hongzimao/decima-sim/blob/c010dd74ff4b7566bd0ac989c90a32cfbc630d84/actor_agent.py#L73-L80

In my opinion, the self.node_act_probs in your code represents the probability of selecting each node, and then you use noise to explore the action space of the Reinforcement Learning problem.

However, after formulating your code, I get $node_acts = argmax(\frac{p}{-log(noise)})$ , where $p \in [0, 1]$ is the probability of selecting each node, $noise \in (0, 1)$ follows normal distribution. But we know that $-log(noise) \in (0, +\infty)$, after the above calculation, maybe the choice of the next node is completely random.

Maybe my understanding is wrong, but how should I understand this implementation?

BTW, have you considered using the $\epsilon-greedy$ policy to explore the action space? And why not select this policy?

hongzimao commented 4 years ago

Hi, happy birthday! Thank you for carefully examining our code. This part of the action sampling is indeed tricky to understand. We learn this from OpenAI's implementation. It's a mathematical trick called Gumbel-max trick. Detailed explanation can be found at http://amid.fish/humble-gumbel, https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/ and https://github.com/openai/baselines/issues/358

About epsilon-greedy, note that the vanilla use of the policy gradient family (e.g., A2C) is on-policy (has to use data sampled from the current policy). Epsilon-greedy creates a bias in the data (because sometimes the action is sampled from random, not just from the current policy). You will need a correction, such as importance sampling, to make the training data unbiased for policy gradient. To avoid all this complication, the standard way of exploration for policy gradient is by increasing the entropy of action distribution and let the random sampling naturally explore. Hope these help!

hongzimao commented 4 years ago

btw, I think tensorflow now supports something like distribution.categorical.sample(). This should be simple to use and robust enough for this application. When we developed Decima, this option did not exist (I think).

jiahy0825 commented 4 years ago

Thank you for your quick reply and answer, which solved my doubts very well!