RL intro documentation comment

rbahumi commented 4 years ago

Hi,

Thanks for this great resource.

I have a comment regarding a statement in rl_intro.rst. Here is the relevant paragraph from the docs:

There is always an optimal policy which deterministically selects an action.

What about cases in which the optimal policy is stochastic like in rock-paper-scissors? In this case, there is no optimal deterministic policy, and also the optimal stochastic policy (choose an action at random) cannot be trivially learned with a value method algorithm.

parvkpr commented 4 years ago

Hey @rbahumi, You're right about the optimal stochastic policy bit. This is why "Exploration" is used in the field of reinforcement learning. As you might have noticed, In large scale problems there is some sort of exploration techniques incorporated based on the environment. They can be explicit(epsilon greedy policy which takes a random action with a probability of epsilon) or implicit (reward engineering). Hope this clarified your issue.

joshua-dean commented 4 years ago

Considering the rock-paper-scissors case, is a random action any better than choosing a single action repeatedly? Assuming the environment will act randomly in an evenly distributed fashion and any observation given to a policy is meaningless, choosing the same action every time in a deterministic manner will yield a 1/3 success rate, the same as randomly choosing an action. If the environment gives a nontrivial observation or the proportion of the opponent's selection is not even, that can be learned to produce a deterministic policy that can outperform a random one. So unless I'm missing something about this, there's no case where the above statement isn't true, as for any even distribution a single decision is optimal, and uneven distributions can be exploited still utilizing determinism.

Additionally, if I recall correctly any non-deterministic algorithm can be simulated by a deterministic one.

rbahumi commented 4 years ago

Considering the rock-paper-scissors case, is a random action any better than choosing a single action repeatedly? Assuming the environment will act randomly in an evenly distributed fashion and any observation given to a policy is meaningless, choosing the same action every time in a deterministic manner will yield a 1/3 success rate, the same as randomly choosing an action. If the environment gives a nontrivial observation or the proportion of the opponent's selection is not even, that can be learned to produce a deterministic policy that can outperform a random one. So unless I'm missing something about this, there's no case where the above statement isn't true, as for any even distribution a single decision is optimal, and uneven distributions can be exploited still utilizing determinism.

Additionally, if I recall correctly any non-deterministic algorithm can be simulated by a deterministic one.

@Alder85 when I was talking about rock-paper-scissors I meant in a game against an intelligent opponent that tries to maximize the reward and dynamically changes its policy. In this case, if one of the players chooses to play anything except choosing an action randomly with probability 1/3, the second player will exploit that. Playing 1/3 is the Nash equilibrium of this game.

rbahumi commented 4 years ago

Hey @rbahumi, You're right about the optimal stochastic policy bit. This is why "Exploration" is used in the field of reinforcement learning. As you might have noticed, In large scale problems there is some sort of exploration techniques incorporated based on the environment. They can be explicit(epsilon greedy policy which takes a random action with a probability of epsilon) or implicit (reward engineering). Hope this clarified your issue.

@parvkpr exploration is not relevant in this case. Exploration is used to explore and find the optimal policy. In the case of rock-paper-scissors the optimal policy is known to be stochastic and this is something a value function cannot learn. In action-value methods, the policy is set implicitly by choosing the action with the maximal action value estimate. This inherent maximization makes it impossible for action-value methods to naturally learn a distribution over actions.

joshua-dean commented 4 years ago

@rbahumi I didn't fully consider that case. Given that the opponent adapts in a random or nonrandom fashion, and only provides trivial observations, it does make sense that a basic policy could not learn the optimal strategy. However, and I'm not completely sure on this: wouldn't an actor that considers the history of it's own actions (such as an RNN or LSTM) still be able to achieve optimal performance in this case? If the opponent adapts in a random fashion, the desired output would be an approximately even distribution across actions, which given a sufficiently large memory of previous actions is something that a deterministic policy should still be capable of. In the case that the opponent adapts in a non-random fashion (meaning there is any way at all to model how the opponent will adapt from given information), a deterministic policy can learn this and then outperform an even stochastic distribution. I'm willing to believe that for the first case to be optimal the memory size for some problems may have to approach infinity (although for most use cases a sufficiently large finite amount is probably 'good enough'), but unless I'm missing something I still think an optimal deterministic policy is technically achievable.

openai / spinningup

RL intro documentation comment #202