[Exercise 2.2] Question about solution (possible error)

vojtamolda / reinforcement-learning-an-introduction

Solutions to exercises in Reinforcement Learning: An Introduction (2nd Edition).

340 stars 74 forks source link

[Exercise 2.2] Question about solution (possible error) #13

Closed ShawnHymel closed 2 years ago

ShawnHymel commented 2 years ago

Thank you for posting these solutions! I have a question about your solution for exercise 2.2. After computing the rewards for t=4, we have the Q values: [-1, +1/3, 0, 0]. As a result, action 2 (A2) should be the optimal action (at Q= +1/3). However, the agent chose A3 going into t=5, which is not optimal. Does that mean both A₄ and A₅ should both be considered definite ε cases (rather than just A₄)?

vojtamolda commented 2 years ago

I'm not sure if I understand your question. I don't think A₅ exists in this exercise.

The way my notation works is that A₁ uses values from Q₀ (initial state equal to all zeros) and the received reward updates the estimate to Q₁. So A₅ would take you from Q₅ to Q₆, but this transition is never mentioned in the text of the exercise.

ShawnHymel commented 2 years ago

Hi @vojtamolda,

Thank you for the quick response! I didn't quite follow the labeling, so let me see if I can describe it better. I'm still new to reinforcement learning, and I'm still learning the terminology.

From Q₃ to Q₄ (which I think you refer to as A₄), the agent selects action 2 (with average expected reward of -1/2). Because it did not select action 3 or 4 (each with average expected reward of 0), we can say that A₄ was the ε case (exploratory).

From Q₄ to Q₅ (A₅), the average expected rewards (before going into t=5) are given by Q⁴ = [-1 +1/3 0 0]. The agent selects action 3, which has an average expected reward of 0. Why did it not select action 2, which had the highest expected reward? Is that because this choice was also an ε case? Or is there some other reason it chose action 3?

vojtamolda commented 2 years ago

Sorry @ShawnHymel, I don't really have bandwidth to explain content of the book here. The material is pretty dense, so I usually ended up re-reading a chapter a few times until it all clicked.

I think the solution to this exercise is correct. In fact, I recently fixed it in #9. But if there's still an error, feel free to reopen the issue and get back to me.

Thanks!

gilzamir18 commented 2 years ago

I think that is wrong yet, because is right in step 5 is exploration. Note that A1 is based on Q1 table and A5 is based on Q5 table, where Q1 = [0, 0, 0, 0].