Closed sezan92 closed 1 year ago
I am and was an RL enthusiast for a long time. The concept of training an agent based on rewards or punishments only fascinated me from the beginning. So I kept exploring RL methods, and architectures by myself for a bit of time. For example, the blogs of The Game is ON! were the result of my learning RL.
But there was something missing. I had some issues with basic terminologies in the Reinforcement learning literature. Also some basic algorithms seemed very difficult for me. So I had to step back . I tried to read The famous book by Sutton and Barto. But there were something missing.
Then I got to know about Reinforcement learning Specialization by University of Alberta
Contd
The big idea is that suppose a doctor has 3 patients. he doesnt know the medicine. so he trials 3 medicines for 3 patients. if he sees improvement in health in a patient for a certain medicine , he prescribes with the medicine.
then the instructor defines the action and the reward combined as action value. here the action is the prescription and value is the better health.
assuming the better health means better blood pressure, the instructor gives intuiton by following illustration
here each action $a$ , (i.e. medicine) has action value $q(a)$ i.e. the blood pressure.
Our real life examples
My note:
learning action values
.Action value is the value of an action. (I know it is not genious to figure it out). But the question is how can we know the value of an action ?
from the video, how can we know the value of the action of prescribing one medicine ? one of the ways is sample averaging method.
for example if some people had a headache and the doctor doesnt know among the medicines $A$ , $B$, $C$ which are the best. he tries all of them . Suppose for medicine $A$ the headache is cured 90 out of 100 times. for medicine $B$ it is 50 out of 100 times. then the action value for $A$ is $90/100$ => $0.9$ . now what about the other 10 ? they might have other factors, which leads us to the notion of $state$ value. more on that later. [Video minute 1:57]
Suppose we get action values for all three actions after many trials. When we choose the action with the best action value, this is called greedy action selection. This process of choosing greedy action is known as exploitation.
On the other hand, we also can explore other actions in the expense of getting best reward . this will let us know more about the actions. this is known, as one might have guessed exploration.
now the problem is, we cannot do both - at least with one model. This is a fundamental problem in Reinforcement learning problem. known as Exploitation Exploration
dilemma.
we can rewrite the sample-average method for learning action values as following,
in other words,
in the equation , $\frac{1}{n}$ is known for a limited number of steps. But what if we do not know how many steps will be taken. For example, we will never know in how many moves a game of chess can be won. In those cases , we can write $\alpha$ in stead! It is a hyperparameter known as step size. it will dictate how quickly our agent can update the action value.
[upto 2:43]
let's look at the equation again,
from the above two equations, we see that $Q_{n+1}$ depends on the most recent rewards more than the past rewards. making it possible to update over time
what does the (re-write of) the equation tells us?
it says that the Action values always give more focus on the recent rewards than the previous ones. Why is it important? It helps the model be updated. There might be some actions that might be time dependent. e.g. some medicines might work in one season, but not in others, and vice versa. This equation helps us keep track of recent rewards and let the model learn from the most recent experiences.
Video1
What is the trade-off
Exploration means the agent tries each action and sees what the action leads to.
Exploitation, on the other hand, means that the agent takes the action with the maximum reward, from his prior knowledge, which is always limited.
Now, if we let the agent always explore, the agent will not ever act according to the previous knowledge of best actions; resulting in never maximizing the total rewards. If we always let it exploit, we might miss the information about all plausible state-action pairs! This is known as the exploration-exploitation dilemma.
Solution?
One of the most popular solutions is Epsilon's greedy actions. For certain times let the agent explore, for other times let it exploit! Then how to know when to do which one? In this case, what we do, is set a threshold named $\epsilon$. Then generate a random floating point number (0.0-1.0). If the random number is greater than $\epsilon$, explore, otherwise exploit! Naturally, if we want to explore more, we will set the $\epsilon$ lower, otherwise bigger. In the training time, we in general want our agent to explore initially more, and exploit at the end more. So in the beginning the $\epsilon$ is higher, and in the end, it is lower! But there are other methods, this is good as a getting-started solution!
suppose for the medical trial of 3 medicines, the optimum values are $q_1 = 0.25, q_2=0.75, q_3=0.5$
but initially, we do not know the optimum value. how about we start with high initial values,
$Q_1 = Q_2 = Q_3 = 2$
now let's remember the incremental update equation,
$Q_{n+1} <- Q_n + \alpha(R_n - Q_n)$
Let's assume positive feedback has point $1$ and negative has $0$. After running some trials, we might get closer to the optimum values,
from the image above it seems that the optimistic initial value setting will help us more compared to the epsilon greedy!
The Epsilon greedy action selection works the following way
Here, for exploration, we are choosing random actions uniformly. The problem is, we are giving the same weight to all random actions. How about while choosing the random action, we can choose the less explored action? In other words, we prioritize the actions with less uncertainty (due to that action being less explored).
For example, the uncertainty can be shown as follows,
in the process of UCB, we guess that the unknown action is good. i.e. high Q value. and hence called the upper confidence action value.
The equation for UCB combines exploration and exploitation like the following,
$t$ is the timesteps and $N_t(a)$ means the number of times an action $a$ is being taken. it means that the more we explore an action $a$, the lesser it will have an effect. like the following example,
where $c$ is the hyperparameter. on the 10-arm bandit test bed, the performance is as follows,
Done in #29
Objective
This issue is to track and work on RL course 1 blog REF: #20
Tasks