vojtamolda / reinforcement-learning-an-introduction

Solutions to exercises in Reinforcement Learning: An Introduction (2nd Edition).
340 stars 74 forks source link

Exercise 4.2 & 4.6 #8

Closed hhoppe closed 2 years ago

hhoppe commented 2 years ago

For the last part of the question, this is my intuition: Because the newly introduced state 15 has the same value (-20.0) as state 13, when the state 13 is modified so that its down action transitions to state 15 instead of to (itself) state 13, there should be no effect on the value of state 13. In other words, applying equation 4.5 to state 13 results in a new value v_{k+1}(s_13) identical to v_k(s_13) because the set of values v_k(s') are unchanged (even though one of the s' is different).

Also, nitpick on exercise 4.6: the \pi(a, s) should equal 1 - \epsilon rather than 1 - \epsilon |A_s| in the upper branch.

vojtamolda commented 2 years ago

So first of all, let's get the easier things out of the way right away. Thanks for "nitpicking" :) You're right about 4.6 There's a typo that causes the new policy to have not-normalized probability.

vojtamolda commented 2 years ago

I agree on the answer to the first part that doesn't allow any transition going into the new state 15. Values of states 13 and 15 will be equal.

I'm not sure if your intuition is correct for the second part of the question. Can you, please, provide, some math behind it to justify it?

I think you assume that the value of state 13 will stay the same even after introducing the new transition and start your argument from there. But I don't think this is true. Introducing an action that will take you from 13 to 15 will have an affect on the value of 13. I don't think it's easy to say by how much just using intuition. Introducing the new state and transition changes the entire MDP and it has an effect on values of all states.

My answer is a bit lazy and assumes the values of "2nd neighbors", i.e. states 8,9 and 10, stay the same. This again, isn't true, but the change hopefully won't very large so it will give a good enough approximate answer. This way I also only have to solve a 4x4 linear system for new values of states 12, 13, 14 and 15.

I think to get the precise value of every state in the new situation with state 15 (and, in fact, to settle our argument) one has to actually re-solve the MDP problem using policy or value iteration algorithm from scratch like the authors do in Example 4.1.

hhoppe commented 2 years ago

We start off with a linear system.

In part 1, we introduce a new state 15 (an additional row) without changing any other row. So we can simply solve for the value corresponding to the last row (state 15). The other values remain unchanged. Notably, v_13 = v_15 = 20.

In part 2, we modify the linear system in the next-to-last row from v_13 = -1 + 1/4 ( v_12 + v_14 + v_9 + v_13) to v_13 = -1 + 1/4 ( v_12 + v_14 + v_9 + v_15)

Let us solve the new linear system iteratively by starting from the values obtained in part 1 using a Jacobi solver. In all rows other than the next-to-last row, the value does not change, v'_i = v_i, i \ne 13. For the next-to-last row, the row has changed but the Jacobi update step will obtain exactly the same value as before: v'_13 = -1 + 1/4 ( v_12 + v_14 + v_9 + v_15) = -1 + 1/4 ( -22 -14 -20 -20) = 20 Because the Jacobi update has not modified the vector v, v is the solution to the linear system.

(In your derivation, I was not able to follow the line v_\pi(15) = ... = 22.)

vojtamolda commented 2 years ago

Thanks for the explanation!

I think, I've realized what I've done wrong. I made an algebraical mistake when solving the 2x2 linear system for v'_13 and v'_15. Had I solved it correctly, i.e get. v'_15 = v'_13 = -20, it would be clear that my "pertuarbation" theory style approximation wouldn't be needed since the value of state 13 didn't change and therefore the rest of the system has an identical solution too; even under the new circumstances with state 15.

vojtamolda commented 2 years ago

Just to be sure, I solved the whole system again using value iteration algorithm. The code is in the attached Jupyter notebook. As a correctness check, I verified it recovers solution from the book when used on Example 4.1. The notebook has to run in the chapter04 directory in order to import code from the mdp.py file.

hhoppe commented 2 years ago

Wonderful; thanks for taking the time to do this! I really appreciate looking through all your answers.

vojtamolda commented 2 years ago

Thanks for the kind words. It's fun. Everyone needs a hobby...