Focus

Learning goal and how this is defined in terms of rewards.

Content

We already saw what happens when the Agent, which is the cart, always moves to the left. The brown pole starts swinging and falling down. The learning goal of the CartPole-v0 environment is to teach the cart how to move so that the pole stays balanced upright.

I am going to show you a video of how that looks like. See how the cart moves gracefully so that the pole stays balanced upright. It's like an elegant circus performer. Our first major project in this course will be go from this dumb Agent which just moves to the left to this graceful circus performer. That's the learning goal in this environment.

Now pay attention because I am going to say something important. In Reinforcement Learning, the Agent is actually never told about this learning goal that I just described. Instead, the environment gives the Agent a reward or a punishment after each action.

This is the second element of the tuple returned by the env.step() function. Typically, if an action is bad for the learning goal, the Agent gets a negative reward, also called a punishment. If an action is good for the learning goal, if gets a positive reward.

Here we see that the Agent gets a reward of +1 for moving to the left. We can modify the line by writing observation, reward, _, _ = env.step(0).

The agent's job is to get the most reward possible during an episode.

What is an episode? CartPole-v0 is actually like a game where your job is to collect the maximum points before the game ends. The game ends when one of the following conditions are met, and you can see these conditions in the Wiki.

The first condition is the pole angle is greater than 12 degrees from the vertical. This means that the cart has not managed to keep the pole vertical, and therefore, the game ends.

The second condition is that the cart position is more than ±2.4 (center of the cart reaches the edge of the display). This means the cart has crashed at the edge of the environment while trying to balance the pole, and therefore, the game ends.

The third condition is that the cart has taken more than 200 actions and none of the first two conditions are met. This means that the cart has balanced the pole for 200 steps and this time the game ends because the cart has won. That's the maximum time for which the cart is expected to balance the pole in this environment.

You can find out whether the game has ended by using the third element of the return value of env.step(). This returns True when an episode has ended. Otherwise, it returns False. We store it in a variable called done.

Let's play out an episode here by adjusting our code. We add the condition that we should break the loop when the done is True. And you see, the episode ends due to the first condition - that the pole angle exceeds 12 degrees from the vertical.

How are the rewards given out in this environment? It turns out that the reward is always +1, no matter what you do. We can print out that rewards and verify this. In this episode it gets a total reward of 9. The job of the agent is to get as much reward as possible during an episode. That number is 200 because an episode can last for as long 200 steps.

This is important. In Reinforcement Learning, the goal of the agent is to maximize its rewards. It does not know the high level learning goal, which is balancing the pole on the cart. But if it learns to maximize the rewards, then the high level learning goal is automatically reached. This is true for all Reinforcement Learning problem, not just this one.

This has huge advantages, because a learning goal of maximizing rewards is general enough to be applicable in many problems. In fact, let's take a look at some other problems we will solve in this course. Here's a problem called Lunar Lander. In this problem, the learning goal is to teach a lunar capsule so that it lands gracefully in the landing zone. Here again, the agent, which is the lunar capsule will get rewards after making each action, which is firing its rockets, and it has to learn to maximize those rewards. If it can do that, it will be able to land gracefully automatically.

You can think of it this way. The agent takes an actions. The environment immediately tells the agent if that action is good or bad. The agent learns over time to only take good actions. Good actions automatically lead the agent to its learning goal.

The second important thing is about the agent's observations. Even though we know that the first element represents the cart position, the second represents cart velocity and so on, the agent itself has no idea about high level concepts like position or velocity. It just sees four numbers and it does not know what those numbers mean. In Reinforcement Learning, we never try to give the agent any environment specific information e.g. about the objects and the laws of physics that drives this environment. The agent is left to figure this out on its own. It sees the environment as four meaningless numbers and receives feedback about its actions from the environment in terms of rewards. Based on that it has to find the meaning in those numbers and use that to maximize the rewards that it gets in the environment.

This approach is so general and environment agnostic that it is often possible to take an agent which has learned in one environment and drop it into another. Just like the agent has figured out the first environment, it will figure out the second one, even though the dynamics of the two environments may be very different. That is one of the features that makes Reinforcement Learning so interesting, powerful and awe inspiring.

Let's summarize what we learned in this lesson.

So now we know the learning goal in this environment, which is to maximize the rewards. We will focus on that as our first major project in this course. However, before we do that, I want to show you the the awesome things that can be learned by this approach of maximizing rewards in an environment agnostic way.

Alternate Content

Reinforcement Learning problems are like video games. In the last lesson, we saw that the agent can take two actions in the CartPole-v0 environment: move to the left and move to the right. You can think of these actions as if the agent is pressing two buttons in a joystick.

We saw what happens in this game as you press buttons. If you keep pressing the "Go to the left" button, the agent moves to the left and the pole starts swinging.

Just like you have points in a game, there are points (also called rewards or punishments) in this problem. You get a reward or a punishment right after taking an action.

This is the second element of the tuple returned by the env.step() function. Here we see that the Agent gets a reward of +1 for moving to the left. We can modify the line by writing observation, reward, _, _ = env.step(0).

Just like in a game, you try to get the most points possible, in a Reinforcement Learning problem the goal is the same: take actions in such a way so as to get the maximum possible rewards before the game ends.

This brings us to the second point. Games mostly end when you do something wrong and die or if you win. For example, in super mario, the game can end if you run into a monster or jump into the abyss. In CartPole-v0, the game ends under three conditions.

The first condition is the pole angle is greater than 12 degrees from the vertical. This means that the cart has not managed to keep the pole vertical, and therefore, the game ends.

The job of the agent is to get as much reward as possible during an episode. That number is 200 because an episode can last for as long 200 steps.

I am going to show you a video of how that looks like. I have trained this agent using Reinforcement Learning to maximize the rewards in this environment. See how the cart moves gracefully so that the pole stays balanced upright. It's like an elegant circus performer.

This illustrates another important point. From the agent's point of view, it is simply maximizing the rewards it gets by taking taking the right actions. It has no concept of balancing. But as a result of maximizing rewards, the high level behavior of balancing emerges automatically. This is a phenomenon that we will see in many Reinforcement Learning problems. If the reward is well engineered, simply maximizing the rewards can lead to high level behavior like being able to walk, drive, balance etc.

This has huge advantages, because very different high level learning goals like walking, driving, balancing etc. can all be formulated in terms of designing a reward mechanism that encourages the emergence of that high level behavior. From the agent's point of view, all these problems look the same - it simply has to maximize the rewards in the environment.

This is why an agent who has learned balancing in one environment can be dropped into another environment to learn driving. Since the problem looks the same from the agent's point of view, which is maximizing rewards, it can learn driving exactly in the same way and using the same code. This makes Reinforcement Learning a surprisingly general and portable machine learning method. Many people say Reinforcement Learning is the closest to human intelligence among the different forms of machine learning exactly for this reason. We humans can also learn different things using the same brain.

In order to make sure that every Reinforcement Learning problem seems like the same problem to the agent, we also try not to put any environment specific knowledge in the Reinforcement Learning algorithms we design to maximize rewards. For example, the algorithm that I wrote for CartPole-v0, which you see in action here, doesn't know what the numbers in the observation means. Of course, I know that the first element is the position, the second is the velocity and so on. But I have not told the algorithm about that. To the algorithm, the environment state is just a bunch of four numbers. Therefore, if I drop the same agent in another environment, it won't complain that the last environment had velocity but this environment has accelaration, so I don't know what to do. Instead it will say that the last environment also had a bunch of numbers, this environment also has a bunch of numbers. So they are the same, and I can use the same trick for learning in this environment that I used in the last one. Don't worry if you don't understand this right away, it's a tricky concept. But it become clear to you later on as we implement some of these learning algorithms.

Let's summarize what we learned in this lesson. We learned that Reinforcement Learning problems are like games. Points are given out for each action and this is called rewards or punishments. This is returned by env.step(), and is the second element of the return value. We learned that the game ends when you fail, which are these two conditions, or if you win, which is this condition. You can figure out if the game has ended by looking at the third element of the return value of env.step() and we call it done. It returns True if the game has ended, False otherwise. Your job is to get the maximum rewards possible before the episode ends, that is before done becomes True. In CartPole-v0, the reward is always +1 for every action and the maximum reward you can get is 200.If you can find the correct action in every step of the game in order to get this maximum reward, the agent shows the awesome high level behavior of being able to balance the pole on the cart like a circus performer. We also discussed how all Reinforcement Learning problems are formulated in terms of rewards in an environment, leading to different high level behavior depending on the environment, action and reward structures.

We will focus on solving CartPole-v0 as the first project in this course, and we will get introduced to our first Reinforcement Learning algorithm in the process. However, before we do that, I want to spend a few videos on showing you what other awesome high level behavior we can get using this very general machine learning method called Reinforcement Learning.

Another alternate content

We saw what happens in this game as you press buttons. If you keep pressing the "Go to the left" button, the agent moves to the left and the pole starts swinging.

Just like you have points in a game, there are points (also called rewards or punishments) in this problem. Here is a gameplay video of Super Mario. Here is the points in the game. You see that the point increases when Mario kills monsters and collects stars. This encourages players to kill more monsters and collect more stars.

The second element returned by the env.step() function are the points in the CartPole-v0 environment. For example, in this case, the number returned is +1. It's not the total points, but the additional points that you get after taking an action. If it positive, it's a reward. If it is negative, it's a punishment. Finally, it can also be zero.

In Super Mario, if a monster crashes onto you, or if you jump into the lava etc. it's game over and you start again. Similarly, CartPole-v0 also has the concept of game over. The third element returned by the env.step() function indicates if the game is over. If it is False, the game continues and you can try to get more rewards. If it is True, then the game is over. We store it in a variable called done.

Just like Super Mario encourages players to kill more monsters and collect more stars by giving rewards and discourages crashing into monsters and falling into the lava by ending the game, the CartPole-v0 environment also does something similar. Let's play one round of the game and see for ourselves.

Print out rewards, done and the pole angle.

We see that as long as the pole angle from the vertical, the agent gets a reward of +1. The environment is basically encouraging the agent to keep the pole angle from the vertical small. As soon as the pole angle reaches 12 degrees, the game is over because done became True. So the environment discourages the pole from swinging more than 12 degrees from the vertical.

Similarly, there is one more condition for game over which we find in the Gym Wiki. This condition means that the agent has crashed onto the boundaries of the environment. The environment discourages that and wants the agent to stay within the limits of the box.

To summarize, the environment encourages the agent to balance the pole vertically while staying within the box, like we see in this video.

In Super Mario, if you do everything right for eight levels, you win the game and the game is over. Mario meets the princess and the ending text appears. This is the moment that players strive towards.

Similarly, in CartPole-v0, if you keep the pole balanced and stay within the boundaries of the environment for 200 steps, you win the game and the game is over. That looks like this. That's the third condition in the Wiki. In this condition, you get a reward of +1 for 200 steps, and get the maximum reward of 200 points in the episode.

Our dumb agent, which always moves to the left, gets 9 points. This kind of strategy - such as always move to the left no matter what environment state you are in - this is called a policy. Another policy could be to move randomly no matter which environment state you are in. Let's try this policy out. This gets a score of 20. Still very much below the maximum possible.

In Reinforcement Learning, we want to find the policy that maximizes the total rewards in an episode - that is the central goal. When you find that policy, this is how the agent will behave. As you can see, the act of maximizing the rewards automatically leads to the awesome skill of being able to balance the pole like an elegant circus performer. This happens because the rewards encourage this skill.

Let's summarize what we have learned. We learned that Reinforcement Learning problems are like games. Points are given out after each action and this is called rewards or punishments. This is returned by env.step(), and is the second element of the return value. The rewards encourage a certain kind of behavior - in this case, it is encouraging the agent to keep the pole vertical while not going out of the bounds of the environment. We learned that the game ends when you make fatal mistakes, which are these two conditions - in these cases, the agent dies and gets no more rewards. You can figure out if the agent has died by looking at the third element of the return value of env.step() and we call it done. It returns True if you died. The maximum reward you can get in this game is 200. That's because if you don't die for 200 steps, you win the game! This is another condition in which done becomes True, but this time that's a good thing. Our goal is to learn a policy of taking actions that maximizes the total reward in the game, and when we do that, the agent shows the awesome high level behavior of being able to balance the pole on the cart like a circus performer.

In the next lesson, we will discuss with some examples why this video game like learning paradigm is very powerful.

gutfeeling / practical_rl_for_coders

Lesson: Rewards and Episodes #18

Focus

Content

Alternate Content

Links

Another alternate content