Lesson: Task, rewards and episodes

In the previous lessons, we saw how the agent can take actions in an environment. In the CartPole-v0 environment, we can make the agent move to the left using env.step(0) and right using env.step(1).

The next step is to give the agent a task in this environment. In CartPole-v0, the agent's task is to perform a circus trick. The trick is to balance the brown pole upright for a certain amount of time, which is 200 time steps.

This task is quite difficult because, as we have seen before, the pole tends to naturally fall down under its own weight. Remember that when the environment is initialized, the pole angle is set to nearly vertical but not exactly vertical. That little angle is enough for the pole to start falling. The agent has to prevent that at all times by moving in the correct way.

In Reinforcement Learning, we approach any task like a dog trainer. Imagine that you have to teach a dog to sit down when you say "sit". How would you do that? You take a bunch of dog treats, and every time the dog sits down when you say "sit", you reward it with a treat. If it doesn't sit, you can punish the dog, even though I do not recommend punishing animals. But it is a theoretical possibility. Since the dog likes rewards and hates punishment - after a bit of conditioning, the dog always sits down when you say sit.

In Reinforcement Learning, we teach the agent to do a task much like a dog. So first, we need to find an equivalent to dog biscuits for the agent. This is called the reward function.

The reward function determines how much reward is given to the agent in every time step. Positive values means reward, 0 means no reward, and negative values means punishment. The value depends on both the environment state and the action. So you can think of it as a trainer's judgment of the desirability of an environment state or an action in a given environment state or combinations of the above. We will see an example right away, so that this becomes more clear.

The good news for us is that Gym already comes with the reward function for the task of balancing the pole. This is a great advantage of Gym. For every environment, we are not only given a task but we are also given a reward function for the task. It's like Gym acts like a trainer and can give dog biscuits to the agent when the agent has earned it.

Remember that the env.step() function returns a 4-tuple. We only discussed the first element so far, which is the new environment state after the action. The second element is the reward given to the agent for this environment state and the action. Currently, the value of the reward is 1. So we can more aptly write this statement as observation, reward, _, _ = env.step(0).

Let's try moving right now and see what reward we get. Wow, we still get one. So whether we move to the left or right doesn't seem to change whether the agent gets its biscuits - it gets the reward in both cases.

So what's the logic behind this reward function? Remember that we said that the reward function is theoretically a function of the environment state and the action. In the case of CartPole-v0, it is a judgement on the environment state only and therefore a function only of the environment, not the action.

The judgment is as follows. We want the pole to be vertical. So we decide that any angle between +- 12 degrees can be considered vertical - and therefore good. If the pole tilts more than 12 degrees, its bad.

Similarly, we want the cart to stay within the environment limits while balancing the pole. So we don't want the cart to crash into the boundaries of the environment. So this is good. The crash is bad.

So as long as the cart is within the environment bounds and the pole is within 12 degrees, everything is good and the agent gets a reward of +1.

But if the pole angle is more than 12 degrees, then that's bad.

I can show that to you. In the loop where the agent was always moving to the left, we can also print out the rewards and the pole angle. Since the pole angle is in radians, I will convert it to degrees. Since 360 degrees is 2*pi radians, we need to multiply the radian value with 360 / 2*pi to get the angle. Then let's print the reward and the angle. We see that initially, the reward is +1 because the pole angle is within 12 degrees. As soon as the pole angle exceeds 12 degrees, the reward is 0 - cause that's bad for balancing the pole.

Another important aspect is the length of these training sessions. Theoretically, a training session can go on forever. The agent can keep taking action and the trainer can keep providing rewards and punishments. But in the CartPole-v0 environment, the training sessions do not go on forever. They are finite and are called episodes.

In the CartPole-v0 environment, episodes end under two conditions. First, if the episode length exceeds 200 time steps, the episode is terminated. That's because the agent's task here is to balance the pole for 200 time steps. Additional time steps are not relevant.

The second condition under which the episodes are terminated relates to the bad environment states that we saw earlier. Instead of letting the agent go into these bad environment states and then try to recover from them, we just kill the agent the very first time it goes into these bad states. This is like "Game Over" in video games. Your main character died and you must start the game all over again. This sort of termination is actually equivalent to a harsher variant of the reward function we discussed before. That reward function allowed for the agent to make mistakes, then recover and get some rewards again. But with episode termination, we rob the agent any chance of getting any more rewards in the future after it makes a mistake.

The agent can check if an episode is over or not by checking the third element of the return value of env.step(). If it is False, the episode has not terminated. If it is True, it has terminated and you should call env.reset() to start over. Therefore, it is better to write this statement as observation, reward, done, _ = env.step(0), where done represents if the episode has terminated or not.

We can also check that using our loop. Let's print the value of done as well. We find that when the pole angle exceeds 12 degrees and the agent enters a bad environment state for the first time, then done becomes True. The agent must stop training and start over again.

Now the crux of Reinforcement Learning is this: maximizing the reward during an episode is equivalent to solving the task. This is very important to understand.

We know that an episode lasts for maximum 200 time steps. We know that the maximum reward in each time step is 1. So the maximum total reward per episode that is possible for the agent to get is 200.

Good. Now let's think backwards. If the agent got a reward of 200 in the episode - this means that it got a reward of 1 for 200 time steps. A reward of 1 implies that the environment state is good, which means that the pole angle is within 12 degrees for all 200 time steps and the cart did not go out of the environment bounds for all 200 time steps. That is just another way of saying that the cart balanced the pole for 200 time steps. This condition of maximum reward therefore looks like this video and is equivalent to solving the task.

I repeat this once more so that it makes sense. The reward function must be so engineered such that maximizing the reward function is equivalent to solving the task. We found that the reward function in CartPole-v0 is correctly engineered.

Given this, the task of an agent in a Reinforcement Learning problem setup becomes very simple. The agent simply needs to learn how to maximize rewards in an environment. This will be the focus of the remainder of the course - given a reward function, how do you maximize it?

But before we dive into that, I am going to spend one more lesson talking about reward functions used in problems from the academia and industry, just so that you get a feel for reward functions. Most of the time, we will find that the reward functions are extremely simple to engineer. We will also find that the Reinforcement Learning agents that learn to maximize those reward functions like a dog get awe inspiring capabilities.

gutfeeling / practical_rl_for_coders

Lesson: Task, rewards and episodes #22