Question about policy type and instinsic discrepancy

wangyu92 commented 3 years ago

Hi. There is something I don't understand while reading your code.

Here is a snippet of multi_agnet.py:

action_cumsum = np.cumsum(action_prob)
bit_rate = (action_cumsum > np.random.randint(1, RAND_RANGE) / float(RAND_RANGE)).argmax()
# Note: we need to discretize the probability into 1/RAND_RANGE steps,
# because there is an intrinsic discrepancy in passing single state and batch states

You have implemented Pensieve using A2C (synchronos, deterministic version of A3C) rather than A3C. Right? So the agent has to deterministicly choose the action as written on the paper. However, the code above does not seem to select an action deterministically. And I want to know exactly what 'intrinsic disprepancy' means.

In summary, (1) Pensieve uses deterministic policy?, (2) if it is stochastic policy, does not use code like distributions.Categorical(probs).sample(), (3) if it is deterministic policy, why the action has randomness (4) what does intrinsic discrepancy mean?

Thank you.

hongzimao commented 3 years ago

Thank you for diving deeply into our codebase. About A2C (or A3C and actually most RL algorithms in general), the training time action selection is not deterministic --- that's how RL fundamentally explores different policies. The determinism we might have written on the paper refers to the inference time: in the case of A2C, you pick the action with highest probability.

About the code snippet, it's about fixing a nasty bug in the early version of tensorflow. I don't think the bug exists nowadays. But if you are interested: when passing a state into policy network in the shape [1, n] (n is feature size), the output action distribution is slightly different from passing the state in a batch of shape [m, n] (m is batch size, we pass the state with other states, especially during training). The difference in distribution --- the discrepancy --- is in the order of 1e^-6. This will create a problem if batch state outputs a 0 probability (i.e., never samples that action) but the single state outputs 1e-6 (will sample the action with small chance). Then if you run training for long enough, you will sample an action that is actually 0 probability in a batch. This creates a NaN in policy gradient and kills the training. It's a very subtle bug but it's crucial to get right when we need to run training for a long time.

About your detailed questions:

deterministic policy? --- only in inference time if needed. Actually, you might want to consider stochastic policy as in the training time to create similar action trajectory so that the agent generalizes better (you should try them both empirically)
why not distributions.Categorical(probs).sample() --- when we developed Pensieve, this method didn't seem to exist :)
see 1 and first paragraph
see second paragraph

wangyu92 commented 3 years ago

Thank you very much for your kind reply. Your answer made me very clear.

hongzimao / pensieve

Question about policy type and instinsic discrepancy #115