vojtamolda / reinforcement-learning-an-introduction

Solutions to exercises in Reinforcement Learning: An Introduction (2nd Edition).
340 stars 74 forks source link

Exercise 5.6: No discount appearing in equation #1

Closed Jonathan2021 closed 3 years ago

Jonathan2021 commented 3 years ago

In your equation for Q(s,a) using Weighted Importance Sampling, you separated Rt from what happens in subsequent timesteps but didn't discount Gt+1 by lambda.

A) Also, can't we just keep the original V(s) expression with t in T(s, a) (instead of t in T(s)), the importance sampling ratio starting from t+1 to T-1 on both sides of the division (for timestep t probability of selecting a is 1 because it already happened) and keeping Gt.

B) It is true that if t = T-1 then the ratio from t+1 to T-1 is ill defined but that was already the case in the original expression.

Thanks for reading !

PS: Your corrections are by far the best, most complete and correct ones I have found. Keep up the good work ;)

vojtamolda commented 3 years ago

Thanks for reporting the mistake and for the kind words, @Jonathan2021. I'm still reading the book, now in chapter 12, so more is coming 😃

I'm on vacation now but I'll make a reminder, take a look at it and fix it once I'm back.

vojtamolda commented 3 years ago

Thanks for the correction. You're right. The discount factor is indeed missing from the equation. I'm just submitting a patch to fix it.

Now regarding your additional points:

A) I think the summation t in T(s, a) wouldn't be correct. The definition of Q(s, a) only conditions the first action to be a. The future visits to s don't have this conditioning but the summation over T(s, a) would incorrectly enforce it for all subsequent visits. It would effectively ignore contributions from taking all the other actions different than a from s.

There's another way to see it isn't correct. The entire formula can be rewritten with the use of the original equation for V(s) (5.6) like this:

IMG_0562.

Where s_t is the state right after s. Summation over T(s, a) is a slightly different but not identical value and wouldn't allow this transformation.

B) Text of the book mentions handling of this corner case but it's easy to miss. The importance ratio is equal to 0 if the denominator is 0. I adopted the assumption but didn't explicitly mention it.

PS: Thanks for warm the comment. I'm actually very surprised that people are reading my hand written scribbles :)

Jonathan2021 commented 3 years ago

Hey @vojtamolda, Didn't have much time to dig into this before now sorry.

By replacing T(s) by T(s,a) in the original expression (first-visit times where s was followed by action a) and adding new notation to the importance sampling ratio (to avoid corner cases), I get the following. 241430179_544679293508327_1163890976622465113_n Edit: Error when defining the new ratio expression above. It is ρ(a)t:t = 1 and not ρ(a)t:t+1 = 1.

It's true that Rt+1 is scalled by stuff that comes afterwards but this is the case for every individual reward in G. The paragraph on Dicounting-aware importance sampling deals with this.

I hope all by blabbering makes some sense.