sezan92 / sezan92.github.io

1 stars 1 forks source link

RL course 1 week 1 blog #21

Closed sezan92 closed 1 year ago

sezan92 commented 1 year ago

Objective

This issue is to track and work on RL course 1 blog REF: #20

Tasks

sezan92 commented 1 year ago

Update 2022/11/01

Story

I am and was an RL enthusiast for a long time. The concept of training an agent based on rewards or punishments only fascinated me from the beginning. So I kept exploring RL methods, and architectures by myself for a bit of time. For example, the blogs of The Game is ON! were the result of my learning RL.

But there was something missing. I had some issues with basic terminologies in the Reinforcement learning literature. Also some basic algorithms seemed very difficult for me. So I had to step back . I tried to read The famous book by Sutton and Barto. But there were something missing.

Then I got to know about Reinforcement learning Specialization by University of Alberta

Contd

sezan92 commented 1 year ago

Update 2023/02/15

TODO

sezan92 commented 1 year ago

Update 2023/02/16

Week 1

TODO Week 1

sezan92 commented 1 year ago

Update 2023/02/22

The big idea is that suppose a doctor has 3 patients. he doesnt know the medicine. so he trials 3 medicines for 3 patients. if he sees improvement in health in a patient for a certain medicine , he prescribes with the medicine.

k_armed_bandit_trial_picture_equation

My note:

TODO

sezan92 commented 1 year ago

Update 2023/02/25

sezan92 commented 1 year ago

Update 2023/02/27

TODO

sezan92 commented 1 year ago

Update 2023/03/06

Action value

Action value is the value of an action. (I know it is not genious to figure it out). But the question is how can we know the value of an action ?

from the video, how can we know the value of the action of prescribing one medicine ? one of the ways is sample averaging method.

Screenshot from 2023-03-06 13-19-15

intuition

for example if some people had a headache and the doctor doesnt know among the medicines $A$ , $B$, $C$ which are the best. he tries all of them . Suppose for medicine $A$ the headache is cured 90 out of 100 times. for medicine $B$ it is 50 out of 100 times. then the action value for $A$ is $90/100$ => $0.9$ . now what about the other 10 ? they might have other factors, which leads us to the notion of $state$ value. more on that later. [Video minute 1:57]

TODO

sezan92 commented 1 year ago

Update 2023/03/08

Greedy action

Suppose we get action values for all three actions after many trials. When we choose the action with the best action value, this is called greedy action selection. This process of choosing greedy action is known as exploitation.

exploitation

On the other hand, we also can explore other actions in the expense of getting best reward . this will let us know more about the actions. this is known, as one might have guessed exploration.

now the problem is, we cannot do both - at least with one model. This is a fundamental problem in Reinforcement learning problem. known as Exploitation Exploration dilemma.

TODO

sezan92 commented 1 year ago

update 2023/03/10

TODO

sezan92 commented 1 year ago

Update 2023/03/14

Incremental update to action value

we can rewrite the sample-average method for learning action values as following,

Screenshot from 2023-03-14 13-14-42

in other words,

Screenshot from 2023-03-14 13-15-45

in the equation , $\frac{1}{n}$ is known for a limited number of steps. But what if we do not know how many steps will be taken. For example, we will never know in how many moves a game of chess can be won. In those cases , we can write $\alpha$ in stead! It is a hyperparameter known as step size. it will dictate how quickly our agent can update the action value.

Screenshot from 2023-03-14 13-16-25

[upto 2:43]

TODO

sezan92 commented 1 year ago

Update 2023/03/16

TODO

sezan92 commented 1 year ago

Update 2023/03/27

Non-stationary bandit problem

let's look at the equation again, Screenshot from 2023-03-27 14-56-30

Screenshot from 2023-03-27 15-01-49

from the above two equations, we see that $Q_{n+1}$ depends on the most recent rewards more than the past rewards. making it possible to update over time

what does the (re-write of) the equation tells us?

it says that the Action values always give more focus on the recent rewards than the previous ones. Why is it important? It helps the model be updated. There might be some actions that might be time dependent. e.g. some medicines might work in one season, but not in others, and vice versa. This equation helps us keep track of recent rewards and let the model learn from the most recent experiences.

TODO

sezan92 commented 1 year ago

2023/04/04

TODO

sezan92 commented 1 year ago

Update 2023/04/08

Exploration Exploitation trade-off

Video1

What is the trade-off

Exploration means the agent tries each action and sees what the action leads to.

Exploitation, on the other hand, means that the agent takes the action with the maximum reward, from his prior knowledge, which is always limited.

Now, if we let the agent always explore, the agent will not ever act according to the previous knowledge of best actions; resulting in never maximizing the total rewards. If we always let it exploit, we might miss the information about all plausible state-action pairs! This is known as the exploration-exploitation dilemma.

Solution?

One of the most popular solutions is Epsilon's greedy actions. For certain times let the agent explore, for other times let it exploit! Then how to know when to do which one? In this case, what we do, is set a threshold named $\epsilon$. Then generate a random floating point number (0.0-1.0). If the random number is greater than $\epsilon$, explore, otherwise exploit! Naturally, if we want to explore more, we will set the $\epsilon$ lower, otherwise bigger. In the training time, we in general want our agent to explore initially more, and exploit at the end more. So in the beginning the $\epsilon$ is higher, and in the end, it is lower! But there are other methods, this is good as a getting-started solution!

TODO

sezan92 commented 1 year ago

Update 2023/04/17

TODO

sezan92 commented 1 year ago

Update 2023/04/19

Optimistic initial values

TODO

sezan92 commented 1 year ago

Update 2023/04/19

Optimistic initial values

For example

suppose for the medical trial of 3 medicines, the optimum values are $q_1 = 0.25, q_2=0.75, q_3=0.5$

but initially, we do not know the optimum value. how about we start with high initial values,

$Q_1 = Q_2 = Q_3 = 2$

now let's remember the incremental update equation,

$Q_{n+1} <- Q_n + \alpha(R_n - Q_n)$

Let's assume positive feedback has point $1$ and negative has $0$. After running some trials, we might get closer to the optimum values,

Screenshot from 2023-04-25 14-08-25

TODO

sezan92 commented 1 year ago

Update 2023/05/01

Comparison

compare_performance

from the image above it seems that the optimistic initial value setting will help us more compared to the epsilon greedy!

Demerits

My opinion

TODO

sezan92 commented 1 year ago

Update 2023/05/09

TODO

sezan92 commented 1 year ago

Update 2023/05/17

TODO

sezan92 commented 1 year ago

Update 2023/05/22

TODO

sezan92 commented 1 year ago

Update 2023/06/05

Uppder confidence Action selection

The Epsilon greedy action selection works the following way

Screenshot from 2023-06-05 12-22-11

Here, for exploration, we are choosing random actions uniformly. The problem is, we are giving the same weight to all random actions. How about while choosing the random action, we can choose the less explored action? In other words, we prioritize the actions with less uncertainty (due to that action being less explored).

TODO

sezan92 commented 1 year ago

Update 2023/06/14

For example, the uncertainty can be shown as follows,

Screenshot from 2023-06-14 17-16-38

in the process of UCB, we guess that the unknown action is good. i.e. high Q value. and hence called the upper confidence action value.

TODO

sezan92 commented 1 year ago

update 2023/07/03

The equation for UCB combines exploration and exploitation like the following,

Screenshot from 2023-07-03 15-29-56

$t$ is the timesteps and $N_t(a)$ means the number of times an action $a$ is being taken. it means that the more we explore an action $a$, the lesser it will have an effect. like the following example,

Screenshot from 2023-07-03 15-33-09

where $c$ is the hyperparameter. on the 10-arm bandit test bed, the performance is as follows,

Screenshot from 2023-07-03 15-35-20

sezan92 commented 1 year ago

TODO 2023/07/03

sezan92 commented 1 year ago

Update 2023/07/06

sezan92 commented 1 year ago

Update 2023/07/10

TODO

sezan92 commented 1 year ago

Update 2023/07/18

TODO

sezan92 commented 1 year ago

Update 2023/07/30

sezan92 commented 1 year ago

Update 2023/07/02

sezan92 commented 1 year ago

Update 2023/08/07

sezan92 commented 1 year ago

Update 2023/08/09

sezan92 commented 1 year ago

Done in #29