sawcordwell / pymdptoolbox

Markov Decision Process (MDP) Toolbox for Python
BSD 3-Clause "New" or "Revised" License
518 stars 252 forks source link

MDP where not all actions are always available #25

Open jniediek opened 6 years ago

jniediek commented 6 years ago

In many relevant MDPs, some actions are unavailable in some states. An example is the "Recycling Robot" (Example 3.3 in Sutton & Barto). In short, a robot cannot recharge its battery when the battery is fully charged.

In pymdptoolbox, however, it's necessary to define a probability of state transition for all actions and all states. If an action is not available in a certain state, these probabilities have no meaning. Setting them to 0 is inadequate and causes errors, because the state transition matrix needs to have row sums of 1 for each action.

Another possibility would be to assign a probability of 1 and reward of 0 (or less) to "remaining in the same state", but this seems a bit artificial, since the action isn't actually available.

Do you have any ideas how to circumvent this problem in a more elegant way?

ryanpeach commented 5 years ago

Another possibility would be to assign a probability of 1 and reward of 0 (or less) to "remaining in the same state", but this seems a bit artificial, since the action isn't actually available.

This is actually what you do when there are no actions available in a state, not what you do when some action isn't available. For the some action category, you just set to 0. For the no action category, you do exactly what you just said.

luohc2004 commented 4 years ago

I understand @jniediek 's question well but can't really get @ryanpeach 's idea. I think a soft constraint is obtainable by setting the Rewards to a large negtive value. Nevertheless, it is different from "hard constraint". To explain @jniediek 's problem better, take a grid world as example. In this grid world, robot can move to adjecent grid each step except grid on the border or the edge of obstacle. Is there any way to hard constrain the action set when robot is in the edge grid?

jniediek commented 4 years ago

I'm still interested in an answer ;)

ryanpeach commented 4 years ago

It’s been a while since I posted that, but this answer should still hold up.

Let’s say there are no possibilities to act in a certain state. The MDP over all actions should in this case state that all actions result in staying in the same state with probability 1 and reward 0. Why reward 0? Because this is technically a sink state, so any infinite series of actions in this state will sum reward infinitely. So if reward is something other than zero, it will become unbounded over infinite time.

Let’s say that some possibility is removed however in a given state. Just set that action to loop back on itself with probability of 1 and some negative reward. Unfortunately in this case the learner will have to learn not to do this. The justification I like to use is that it “wastes time”.

There are solutions to actually removing actions in given states, but they aren’t in the scope of MDP’s. For instance, you can define a mask over actions for every state, and then any published algorithm over the mdp that applies max, min, or sum over the actions of a state can be reasonably reinterpreted as having the masked max, min, or sum over the actions instead. So if your agent argmaxes over a value function of actions, just masked argmax over it instead. Same goes for your iterative value function generators as well. Set reasonable default values. For instance, the max min and sum over a fully masked action set should always be 0, because again these are sink cases.

Hope that helps and be sure to fact check me.