Exercise 5.6: No discount appearing in equation

Jonathan2021 commented 3 years ago

In your equation for Q(s,a) using Weighted Importance Sampling, you separated Rt from what happens in subsequent timesteps but didn't discount Gt+1 by lambda.

A) Also, can't we just keep the original V(s) expression with t in T(s, a) (instead of t in T(s)), the importance sampling ratio starting from t+1 to T-1 on both sides of the division (for timestep t probability of selecting a is 1 because it already happened) and keeping Gt.

B) It is true that if t = T-1 then the ratio from t+1 to T-1 is ill defined but that was already the case in the original expression.

Thanks for reading !

PS: Your corrections are by far the best, most complete and correct ones I have found. Keep up the good work ;)

vojtamolda commented 3 years ago

Thanks for reporting the mistake and for the kind words, @Jonathan2021. I'm still reading the book, now in chapter 12, so more is coming 😃

I'm on vacation now but I'll make a reminder, take a look at it and fix it once I'm back.

vojtamolda commented 3 years ago

Thanks for the correction. You're right. The discount factor is indeed missing from the equation. I'm just submitting a patch to fix it.

Now regarding your additional points:

A) I think the summation t in T(s, a) wouldn't be correct. The definition of Q(s, a) only conditions the first action to be a. The future visits to s don't have this conditioning but the summation over T(s, a) would incorrectly enforce it for all subsequent visits. It would effectively ignore contributions from taking all the other actions different than a from s.

There's another way to see it isn't correct. The entire formula can be rewritten with the use of the original equation for V(s) (5.6) like this:

.

Where s_t is the state right after s. Summation over T(s, a) is a slightly different but not identical value and wouldn't allow this transformation.

B) Text of the book mentions handling of this corner case but it's easy to miss. The importance ratio is equal to 0 if the denominator is 0. I adopted the assumption but didn't explicitly mention it.

PS: Thanks for warm the comment. I'm actually very surprised that people are reading my hand written scribbles :)

Jonathan2021 commented 3 years ago

Hey @vojtamolda, Didn't have much time to dig into this before now sorry.

I don't really get your point about enforcing all subsequent visits to s to take action a (I probably just didn't understand what you were trying to say). If T(s,a) is the first-visit to s followed by action a, then another visit to s in the episode followed by any action doesn't change anything to the expression. I think your expression in the correction is incorrect. T(s) being, by definition, all timesteps where s was first visited, the t in the red section would be the first time s was visited in an episode no matter the following action. t could be before the first visit to (s, a) if s was visited before without taking action a. This will sum returns and ratios we don't want to estimate Q(s,a). And even if you replace s by s' (s' being the state that occured after taking action a in s in the first visit), s' could have occured before too as it is not necessarily visited only by taking action a prior.
Another reason why your expression is incorrect is the following (even with my proposed modifications above). Suppose you only have episodes that end after the first occurence of (s, a) (meaning we only get one reward). In this case, Gt+1 = 0 and Gt=Rt+1 all the time. Now your expression can be reduced to sum of all Rt+1, which is incorrect because it isn't averaged in any way.

By replacing T(s) by T(s,a) in the original expression (first-visit times where s was followed by action a) and adding new notation to the importance sampling ratio (to avoid corner cases), I get the following. 241430179_544679293508327_1163890976622465113_n Edit: Error when defining the new ratio expression above. It is ρ(a)t:t = 1 and not ρ(a)t:t+1 = 1.

It's true that Rt+1 is scalled by stuff that comes afterwards but this is the case for every individual reward in G. The paragraph on Dicounting-aware importance sampling deals with this.

I hope all by blabbering makes some sense.

vojtamolda / reinforcement-learning-an-introduction

Exercise 5.6: No discount appearing in equation #1