Learning goal and how this is defined in terms of rewards.


We already saw what happens when the Agent, which is the cart, always moves to the left. The brown pole starts swinging and falling down. The learning goal of the CartPole-v0 environment is to teach the cart how to move so that the pole stays balanced upright.

I am going to show you a video of how that looks like. See how the cart moves gracefully so that the pole stays balanced upright. It's like an elegant circus performer. Our first major project in this course will be go from this dumb Agent which just moves to the left to this graceful circus performer. That's the learning goal in this environment.

Now pay attention because I am going to say something important. In Reinforcement Learning, the Agent is actually never told about this learning goal that I just described. Instead, the environment gives the Agent a reward or a punishment after each action.

This is the second element of the tuple returned by the env.step() function. Typically, if an action is bad for the learning goal, the Agent gets a negative reward, also called a punishment. If an action is good for the learning goal, if gets a positive reward.

Here we see that the Agent gets a reward of +1 for moving to the left. We can modify the line by writing observation, reward, _, _ = env.step(0).

The agent's job is to get the most reward possible during an episode.

What is an episode? CartPole-v0 is actually like a game where your job is to collect the maximum points before the game ends. The game ends when one of the following conditions are met, and you can see these conditions in the Wiki.

The first condition is the pole angle is greater than 12 degrees from the vertical. This means that the cart has not managed to keep the pole vertical, and therefore, the game ends.

The second condition is that the cart position is more than ±2.4 (center of the cart reaches the edge of the display). This means the cart has crashed at the edge of the environment while trying to balance the pole, and therefore, the game ends.

The third condition is that the cart has taken more than 200 actions and none of the first two conditions are met. This means that the cart has balanced the pole for 200 steps and this time the game ends because the cart has won. That's the maximum time for which the cart is expected to balance the pole in this environment.

You can find out whether the game has ended by using the third element of the return value of env.step(). This returns True when an episode has ended. Otherwise, it returns False. We store it in a variable called done.

Let's play out an episode here by adjusting our code. We add the condition that we should break the loop when the done is True. And you see, the episode ends due to the first condition - that the pole angle exceeds 12 degrees from the vertical.

How are the rewards given out in this environment? It turns out that the reward is always +1, no matter what you do. We can print out that rewards and verify this. In this episode it gets a total reward of 9. The job of the agent is to get as much reward as possible during an episode. That number is 200 because an episode can last for as long 200 steps.

This is important. In Reinforcement Learning, the goal of the agent is to maximize its rewards. It does not know the high level learning goal, which is balancing the pole on the cart. But if it learns to maximize the rewards, then the high level learning goal is automatically reached. This is true for all Reinforcement Learning problem, not just this one.

This has huge advantages, because a learning goal of maximizing rewards is general enough to be applicable in many problems. In fact, let's take a look at some other problems we will solve in this course. Here's a problem called Lunar Lander. In this problem, the learning goal is to teach a lunar capsule so that it lands gracefully in the landing zone. Here again, the agent, which is the lunar capsule will get rewards after making each action, which is firing its rockets, and it has to learn to maximize those rewards. If it can do that, it will be able to land gracefully automatically.

You can think of it this way. The agent takes an actions. The environment immediately tells the agent if that action is good or bad. The agent learns over time to only take good actions. Good actions automatically lead the agent to its learning goal.

The second important thing is about the agent's observations. Even though we know that the first element represents the cart position, the second represents cart velocity and so on, the agent itself has no idea about high level concepts like position or velocity. It just sees four numbers and it does not know what those numbers mean. In Reinforcement Learning, we never try to give the agent any environment specific information e.g. about the objects and the laws of physics that drives this environment. The agent is left to figure this out on its own. It sees the environment as four meaningless numbers and receives feedback about its actions from the environment in terms of rewards. Based on that it has to find the meaning in those numbers and use that to maximize the rewards that it gets in the environment.

This approach is so general and environment agnostic that it is often possible to take an agent which has learned in one environment and drop it into another. Just like the agent has figured out the first environment, it will figure out the second one, even though the dynamics of the two environments may be very different. That is one of the features that makes Reinforcement Learning so interesting, powerful and awe inspiring.

Let's summarize what we learned in this lesson.

So now we know the learning goal in this environment, which is to maximize the rewards. We will focus on that as our first major project in this course. However, before we do that, I want to show you the the awesome things that can be learned by this approach of maximizing rewards in an environment agnostic way.

