vojtamolda / reinforcement-learning-an-introduction

Solutions to exercises in Reinforcement Learning: An Introduction (2nd Edition).
340 stars 74 forks source link

[Exercise 2.2] Question about solution (possible error) #13

Closed ShawnHymel closed 2 years ago

ShawnHymel commented 2 years ago

Thank you for posting these solutions! I have a question about your solution for exercise 2.2. After computing the rewards for t=4, we have the Q values: [-1, +1/3, 0, 0]. As a result, action 2 (A2) should be the optimal action (at Q= +1/3). However, the agent chose A3 going into t=5, which is not optimal. Does that mean both A4 and A5 should both be considered definite ε cases (rather than just A4)?

Screen Shot 2022-05-18 at 12 56 44 PM
vojtamolda commented 2 years ago

I'm not sure if I understand your question. I don't think A5 exists in this exercise.

The way my notation works is that A1 uses values from Q0 (initial state equal to all zeros) and the received reward updates the estimate to Q1. So A5 would take you from Q5 to Q6, but this transition is never mentioned in the text of the exercise.

ShawnHymel commented 2 years ago

Hi @vojtamolda,

Thank you for the quick response! I didn't quite follow the labeling, so let me see if I can describe it better. I'm still new to reinforcement learning, and I'm still learning the terminology.

From Q3 to Q4 (which I think you refer to as A4), the agent selects action 2 (with average expected reward of -1/2). Because it did not select action 3 or 4 (each with average expected reward of 0), we can say that A4 was the ε case (exploratory).

From Q4 to Q5 (A5), the average expected rewards (before going into t=5) are given by Q4 = [-1 +1/3 0 0]. The agent selects action 3, which has an average expected reward of 0. Why did it not select action 2, which had the highest expected reward? Is that because this choice was also an ε case? Or is there some other reason it chose action 3?

vojtamolda commented 2 years ago

Sorry @ShawnHymel, I don't really have bandwidth to explain content of the book here. The material is pretty dense, so I usually ended up re-reading a chapter a few times until it all clicked.

I think the solution to this exercise is correct. In fact, I recently fixed it in #9. But if there's still an error, feel free to reopen the issue and get back to me.

Thanks!

gilzamir18 commented 2 years ago

I think that is wrong yet, because is right in step 5 is exploration. Note that A1 is based on Q1 table and A5 is based on Q5 table, where Q1 = [0, 0, 0, 0]. image