RL course 1 week 1 blog

sezan92 commented 1 year ago

Objective

This issue is to track and work on RL course 1 blog REF: #20

Tasks

[x] Plan a story
[x] Make sure the codes are available
[x] draw rough idea

sezan92 commented 1 year ago

Update 2022/11/01

Story

I am and was an RL enthusiast for a long time. The concept of training an agent based on rewards or punishments only fascinated me from the beginning. So I kept exploring RL methods, and architectures by myself for a bit of time. For example, the blogs of The Game is ON! were the result of my learning RL.

But there was something missing. I had some issues with basic terminologies in the Reinforcement learning literature. Also some basic algorithms seemed very difficult for me. So I had to step back . I tried to read The famous book by Sutton and Barto. But there were something missing.

Then I got to know about Reinforcement learning Specialization by University of Alberta

Contd

[x] Introduction to the speciailization
[x] Introduction to the course

sezan92 commented 1 year ago

Update 2023/02/15

[x] Started planning on this blog

TODO

[x] Select the contents to add

sezan92 commented 1 year ago

Update 2023/02/16

[x] Selected the contents. Explain the lessons in your own language. use the screenshots
[x] Refer to already available notes
[x] If possible , write the differences with your blog

Week 1

Introduced k-arm bandit problem. It is a simple game with multiple buttons. (screenshot). Each button will give you reward. you need to choose the best button by pressing each button and maximizing the reward. Best similar thing is gambing!! But in our life we may take decisions by trial and error.

TODO Week 1

[x] take necessary screenshots
[x] write down the equations
[ ] explain the equations in simple term

sezan92 commented 1 year ago

Update 2023/02/22

Here the instructor gave example of clinical trial as k-armed bandit

The big idea is that suppose a doctor has 3 patients. he doesnt know the medicine. so he trials 3 medicines for 3 patients. if he sees improvement in health in a patient for a certain medicine , he prescribes with the medicine.

then the instructor defines the action and the reward combined as action value. here the action is the prescription and value is the better health.
assuming the better health means better blood pressure, the instructor gives intuiton by following illustration

k_armed_bandit_trial_picture_equation

here each action $a$ , (i.e. medicine) has action value $q(a)$ i.e. the blood pressure.
Our real life examples

My note:

I think clinical trial example is not good. Because they are not random trials. After many researches the doctors may get some candidates then they make a trial
also all 3 patients must have similar condition.
the instructor could have gone with a simpler example, like food for a certain location etc
ALso i find the intuition of the equation not so good.

TODO

Explain the intuition of the equation.

sezan92 commented 1 year ago

Update 2023/02/25

[x] Couldnt find an intuition to explain the equation with probability. Maybe we can keep it for later
[x] Week 1 lecture Learning action values, 1:05

sezan92 commented 1 year ago

Update 2023/02/27

[x] completed the video on learning action values.
[x] need to explain in my own words.

TODO

[x] explain in your own words. the learning action values.

sezan92 commented 1 year ago

Update 2023/03/06

Action value

Action value is the value of an action. (I know it is not genious to figure it out). But the question is how can we know the value of an action ?

from the video, how can we know the value of the action of prescribing one medicine ? one of the ways is sample averaging method.

Screenshot from 2023-03-06 13-19-15

intuition

for example if some people had a headache and the doctor doesnt know among the medicines $A$ , $B$, $C$ which are the best. he tries all of them . Suppose for medicine $A$ the headache is cured 90 out of 100 times. for medicine $B$ it is 50 out of 100 times. then the action value for $A$ is $90/100$ => $0.9$ . now what about the other 10 ? they might have other factors, which leads us to the notion of $state$ value. more on that later. [Video minute 1:57]

TODO

[x] start from 1:57

sezan92 commented 1 year ago

Update 2023/03/08

Greedy action

Suppose we get action values for all three actions after many trials. When we choose the action with the best action value, this is called greedy action selection. This process of choosing greedy action is known as exploitation.

exploitation

On the other hand, we also can explore other actions in the expense of getting best reward . this will let us know more about the actions. this is known, as one might have guessed exploration.

now the problem is, we cannot do both - at least with one model. This is a fundamental problem in Reinforcement learning problem. known as Exploitation Exploration dilemma.

TODO

[x] estimating action value incrementally

sezan92 commented 1 year ago

update 2023/03/10

[x] watched video estimating action vallues incrementallly

TODO

[x] need to explain in simple terms

sezan92 commented 1 year ago

Update 2023/03/14

Incremental update to action value

we can rewrite the sample-average method for learning action values as following,

Screenshot from 2023-03-14 13-14-42

in other words,

Screenshot from 2023-03-14 13-15-45

in the equation , $\frac{1}{n}$ is known for a limited number of steps. But what if we do not know how many steps will be taken. For example, we will never know in how many moves a game of chess can be won. In those cases , we can write $\alpha$ in stead! It is a hyperparameter known as step size. it will dictate how quickly our agent can update the action value.

Screenshot from 2023-03-14 13-16-25

[upto 2:43]

TODO

[ ] start from 2:43

sezan92 commented 1 year ago

Update 2023/03/16

[x] watched the video 2:43 to the end
[x] it discusesses about decaying past reward. i.e. most recent rewards actually matter the most from equation

TODO

[ ] explain 2:43 to the end of video about decaying past rewards

sezan92 commented 1 year ago

Update 2023/03/27

Non-stationary bandit problem

let's look at the equation again, Screenshot from 2023-03-27 14-56-30

Screenshot from 2023-03-27 15-01-49

from the above two equations, we see that $Q_{n+1}$ depends on the most recent rewards more than the past rewards. making it possible to update over time

what does the (re-write of) the equation tells us?

it says that the Action values always give more focus on the recent rewards than the previous ones. Why is it important? It helps the model be updated. There might be some actions that might be time dependent. e.g. some medicines might work in one season, but not in others, and vice versa. This equation helps us keep track of recent rewards and let the model learn from the most recent experiences.

TODO

[x] check if the understanding of the equation is okay

sezan92 commented 1 year ago

2023/04/04

Updated comment https://github.com/sezan92/sezan92.github.io/issues/21#issuecomment-1484541989

TODO

[x] start from the video Exploration vs Exploitation tradeoff

sezan92 commented 1 year ago

Update 2023/04/08

Exploration Exploitation trade-off

Video1

What is the trade-off

Exploration means the agent tries each action and sees what the action leads to.

Exploitation, on the other hand, means that the agent takes the action with the maximum reward, from his prior knowledge, which is always limited.

Now, if we let the agent always explore, the agent will not ever act according to the previous knowledge of best actions; resulting in never maximizing the total rewards. If we always let it exploit, we might miss the information about all plausible state-action pairs! This is known as the exploration-exploitation dilemma.

Solution?

One of the most popular solutions is Epsilon's greedy actions. For certain times let the agent explore, for other times let it exploit! Then how to know when to do which one? In this case, what we do, is set a threshold named $\epsilon$. Then generate a random floating point number (0.0-1.0). If the random number is greater than $\epsilon$, explore, otherwise exploit! Naturally, if we want to explore more, we will set the $\epsilon$ lower, otherwise bigger. In the training time, we in general want our agent to explore initially more, and exploit at the end more. So in the beginning the $\epsilon$ is higher, and in the end, it is lower! But there are other methods, this is good as a getting-started solution!

TODO

[x] Revise

sezan92 commented 1 year ago

Update 2023/04/17

[x] revised

TODO

[x] Watch Optimistic Initial Values

sezan92 commented 1 year ago

Update 2023/04/19

Optimistic initial values

this is the technique where we set high $Q$ values to all actions , which lets exploration early (to be ctd)

TODO

[x] Explain with examples

sezan92 commented 1 year ago

Update 2023/04/19

Optimistic initial values

For example

suppose for the medical trial of 3 medicines, the optimum values are $q_1 = 0.25, q_2=0.75, q_3=0.5$

but initially, we do not know the optimum value. how about we start with high initial values,

$Q_1 = Q_2 = Q_3 = 2$

now let's remember the incremental update equation,

$Q_{n+1} <- Q_n + \alpha(R_n - Q_n)$

Let's assume positive feedback has point $1$ and negative has $0$. After running some trials, we might get closer to the optimum values,

Screenshot from 2023-04-25 14-08-25

TODO

[x] start from 4:36
[x] add plot of performance comparison
[x] add demerits of optimistic initial values

sezan92 commented 1 year ago

Update 2023/05/01

Comparison

compare_performance

from the image above it seems that the optimistic initial value setting will help us more compared to the epsilon greedy!

Demerits

We do not know what is the best optimistic initial value.
it is temporary, i.e. after the first initial trials the explorations will get stopped

My opinion

We can use both epsilon greedy and optimistic initial value.

TODO

[x] start Upper-Confidence Bound Action selection

sezan92 commented 1 year ago

Update 2023/05/09

[x] completed uppor-confidence bound action selection
[x] need to recheck with the book to understand it better and explain better

TODO

[x] recheck with the book to understand better.

sezan92 commented 1 year ago

Update 2023/05/17

[x] read from book, got some more clarity. more is needed

TODO

[x] check online , re-read the book and other resources if needed.

sezan92 commented 1 year ago

Update 2023/05/22

[x] checked the video again. understood.

TODO

[x] write about it.

sezan92 commented 1 year ago

Update 2023/06/05

Uppder confidence Action selection

The Epsilon greedy action selection works the following way

Screenshot from 2023-06-05 12-22-11

Here, for exploration, we are choosing random actions uniformly. The problem is, we are giving the same weight to all random actions. How about while choosing the random action, we can choose the less explored action? In other words, we prioritize the actions with less uncertainty (due to that action being less explored).

TODO

[x] start from 1:30 minute

sezan92 commented 1 year ago

Update 2023/06/14

For example, the uncertainty can be shown as follows,

Screenshot from 2023-06-14 17-16-38

in the process of UCB, we guess that the unknown action is good. i.e. high Q value. and hence called the upper confidence action value.

TODO

[x] start from 3:07

sezan92 commented 1 year ago

update 2023/07/03

The equation for UCB combines exploration and exploitation like the following,

Screenshot from 2023-07-03 15-29-56

$t$ is the timesteps and $N_t(a)$ means the number of times an action $a$ is being taken. it means that the more we explore an action $a$, the lesser it will have an effect. like the following example,

Screenshot from 2023-07-03 15-33-09

where $c$ is the hyperparameter. on the 10-arm bandit test bed, the performance is as follows,

Screenshot from 2023-07-03 15-35-20

sezan92 commented 1 year ago

TODO 2023/07/03

[x] publish blog on Week 1 of the course.

sezan92 commented 1 year ago

Update 2023/07/06

[x] added text upto https://github.com/sezan92/sezan92.github.io/issues/21#issuecomment-1439918748 in https://github.com/sezan92/sezan92.github.io/pull/29/commits/1d9faa29dc88b79649c903be7e54d071d2bb79e3

sezan92 commented 1 year ago

Update 2023/07/10

[x] added full text https://github.com/sezan92/sezan92.github.io/pull/29/commits/7b26903901267a16faa027ad8866a9e15e55ca3f

TODO

[ ] Download the images
[ ] make a version in Notion
[ ] correct the texts

sezan92 commented 1 year ago

Update 2023/07/18

[x] https://github.com/sezan92/sezan92.github.io/pull/29/commits/fa103fa7d734f2da74f55919a69238f208ae21e0

TODO

[ ] correct the blog and publish asap

sezan92 commented 1 year ago

Update 2023/07/30

[x] Upto greedy action https://github.com/sezan92/sezan92.github.io/pull/29/commits/5aafc32351841c466a853a58835cf37051545932

sezan92 commented 1 year ago

Update 2023/07/02

[x] updated image locations https://github.com/sezan92/sezan92.github.io/pull/29/commits/ad82d1da5c47c80e19aa97c1e3caf35f97de983d

sezan92 commented 1 year ago

Update 2023/08/07

[x] updated images and corrected sentences https://github.com/sezan92/sezan92.github.io/pull/29/commits/2c2d91d81f44f680363eb3300eb12aab693e90aa

sezan92 commented 1 year ago

Update 2023/08/09

sezan92 commented 1 year ago

Done in #29

sezan92 / sezan92.github.io