qzed / irl-maxent

Maximum Entropy and Maximum Causal Entropy Inverse Reinforcement Learning Implementation in Python
MIT License
221 stars 56 forks source link

Multiple terminal states #3

Open siddhya opened 1 year ago

siddhya commented 1 year ago

Thanks a lot for such a well-documented code. It is making it really easy for me to adapt for my use-case.

My MDP has multiple terminal states and I was wondering how to change the local_action_probabilities() code for that scenario. The Ziebert paper mentions you have to do it for all terminal states but not sure how to combine them. Thanks for your help!

qzed commented 1 year ago

I think you don't need to change the code. It should work with a list of states thanks to numpy's indexing.

The general idea is that you set all terminal states equal to one, i.e. $$\forall si \in S\mathrm{terminal} : Z_{si, t=0} = 1$$ Then you recursively back-up from those terminal states by first computing $Z{a{i,j},t}$ from $Z{si,t-1}$ and $Z{si,t}$ from $Z{a_{i,j},t}$. So the back-up process is essentially the same.

siddhya commented 1 year ago

I am using MaxEnt for Iterated Prisoner's Dilemma. It is an unending game and I limited it to n iterations. So, any state can be terminal state. Will try adding all the states to the list, thanks!

qzed commented 1 year ago

Ah, interesting. I haven't worked with this (and IRL in general) for a while so please be aware that I might not be the best option on answering theoretical questions about this.

However, I think that initializing all states to one and then iterating for some fixed number of steps should theoretically work and be a sound approximation to the unending game. Now, there could be a technical issue in that the $Z_{si}$ and $Z{a_{i,j}}$ will "run off" towards infinity if $\exp(R(s_i)) > 1$ or zero if $\exp(R(si)) < 1$ and you iterate for a long time. In the end, what we're really interested in though is $$P(a{i,j}, si) = \frac{Z{a{i,j}}}{Z{si}},$$ which is a relative quantity. So if that becomes a problem, you could maybe after each iteration try to find some scalar $c$ (maybe max/min/avg) and divide all $Z{si}$ and $Z{a_{i,j}}$ by it (but maybe there are better options if you think a bit longer about it than I did).

Instead of that, however, you probably rather want to consider using maximum causal entropy IRL (see local_causal_action_probabilities()). The theoretical foundation there is based on "soft" Bellman equations (although I'm honestly not sure I can give you the full theoretical derivation on that any more), due to which you essentially compute action- and state-value functions, Q and V. The nice thing there is that this now uses a softmax to compute the state-value function V. That means it doesn't run off and, ideally, converges. Now, I'm not sure if it will converge to a single solution in your endless case, so maybe you might need to change the while delta > eps loop to something else.

Also, for the causal function you'll have to specify the terminal reward function yourself (because of this check), but you can just set it to zero everywhere and it should work fine (i.e. terminal = np.zeros(n_states)).

siddhya commented 1 year ago

Thanks. As a first step (to verify the new world env I added) I set one of the states as the terminal state. That worked fairly well, and the reward function was recovered close to the original.

After that I tried the infinite horizon case. For that I commented out p_transition[terminal, :, :] = 0.0 in expected_svf_from_policy() and also changed the loop to for _ in range(2 * n_states). But the recovered reward function is not correct for both maxent and maxent_causal.

At this point I am not sure if it is a question of tweaking some parameters or I have got something fundamentally wrong.

qzed commented 1 year ago

Sorry for the late response. I think this suggests that the expected SVF doesn't converge to something meaningful, which is probably more of a theoretical or algorithmical issue than just tweaking parameters. Re convergence, Ziebart et al. write

Given parameter weights, the partition function, Z(θ), always converges for finite horizon problems and infinite horizons problems with discounted reward weights. For infinite horizon problems with zero-reward absorbing states, the partition function can fail to converge even when the rewards of all states are negative.

There is a paper (https://arxiv.org/abs/2012.00889) proposing an algorithm that doesn't explicitly rely on terminal states (and has some other improvements), so maybe that would work better here.