[Exercise 2.3] probability of picking the optimal action for epsilon = 0

vojtamolda / reinforcement-learning-an-introduction

Solutions to exercises in Reinforcement Learning: An Introduction (2nd Edition).

325 stars 71 forks source link

[Exercise 2.3] probability of picking the optimal action for epsilon = 0 #19

Closed minimental closed 1 year ago

minimental commented 1 year ago

Dear Vojta,

thank you for publishing your solutions! Great move. Most likely, it is due to my lack of understanding of statistics, but in your solution to exercise 2.3 you are saying that the probability of picking the best action is

p(Ai = a*) = 1 - (1 - 1/n) epsilon

But if epsilon = 0 (greedy), the probability of picking the best action is 1, according to that. Due to the random nature of picking an action using the greedy algorithm (and being unable to change to a better one), I would assume the probability to be considerably less than 1.

vojtamolda commented 1 year ago

Thanks for the kind words. I really appreciate it.

It's been a while since I worked these out, but from what I remember, the greedy strategy never does any exploration to try other things than what the agent knows as the best action. So it never has a chance to find out there's a better strategy. In some sense the best action is meant relative to the limited knowledge of the "world" the greedy agent has.

minimental commented 1 year ago

Hi Vojta,

thank you for the quick reply, and please excuse the delay. It took me a while to understand your reasoning/your formula for epsilon > 0, and I came to the same conclusion. Likelihood for choosing the best action for epsilon > 0 is

1 - (1-1/n) * epsilon,

where n is the number of actions, and epsilon is the frequency for picking actions at random.

My only issue really is that the above formula is invalid for epsilon = 0. If you solely rely on greedy choices, the likelihood of choosing the best action is actually pretty involved. See this article for an upper bound.