Focus

The environment and the agent

What and why

In the last lesson, we saw the visual representation of the CartPole-v0 problem. In this lesson, we will try to understand this visual representation and in the process, we will pick up some basic Reinforcement Learning concepts.

Content

Every Reinforcement Learning problem has an Agent. In the CartPole-v0 problem, the black rectangle that you see in the middle is the Agent. It is called a cart in this problem.
The Agent lives in an Environment, which is the world in which the Agent lives. In this case, the rectangular screen is the Environment. It contains the Agent or the cart, a black wire and a brown pole attached to the cart.
The gym.make() function returns such an environment, which contains the agent and other objects. That's also why we stored the return value in a variable called env, which is short for environment.
Just like you are watching the environment in this popup window, the Agent can also observe its environment, even though it may not observe the environment the way we humans do. When we initialize the environment using env.reset() , it returns this array of four floating point numbers. This array represents the Agent's observation of the system.
What do these four numbers represent? We can find that out by looking at the Wiki in Gym's GitHub page. I will put a link to this page in the lecture notes.
As we see in this page, the first element is the cart position and can vary from -2.4 to +2.4. When we use the env.reset() function, it initializes the environment. The initial state places the cart at the center, but not exactly, There is a little of randomness - every initialization will place the cart at a slightly different position very near the center. That's why the first element is 0.something, usually a very small value, but not exactly zero.
The second element is the current cart velocity along the wire and can range from -infinity to +infinity. This is also set to nearly zero but with a bit of randomness at every initialization.
The third element is the pole angle from the vertical and this can range between -41.8 degrees and +41.8 degrees, expressed in radians of course. The vertical position has the value 0. When the pole is titled to the right, then the pole angle is positive. When it is tilted to the left, the pole angle is negative. We find that the current value is 0.something, which is very close to vertical. The tiny deviation from the vertical position is once again to add an element of randomness and to make the pole fall under its own weight.
The last element is the velocity of the tip of the pole. This can also range from -inf to +inf. The faster the pole falls, the higher the absolute value.
Since env.reset() returns the observation of the Agent, we can assign the return value to a variable called observation and then print that out. This is more appropriate.
What is the Python type of the agents observation? If we print type(observation), we find that it is a special data type called Box(4). This datatype is defined by gym. Here, Box tells us that it is a sequence of floating point numbers. Box(4) means that it is a sequence of 4 floating point numbers. which can take values in a specific range. Box(2) on the other hand would mean a sequence of 2 floating point numbers, taking values in specific ranges. gym has many of these quirky data types, and we will discover more of them as we deal with more environments in the course.

Summary

Let's summarize what we learned. We learned that a Reinforcement Learning problem has an Agent which lives in an environment. This is what we see in the visual representation of the environment by typing env.render(). The env.reset() function initializes the environment by placing the agent (or the cart) nearly at the center, by setting the pole nearly vertical and by setting the cart's velocity along the wire and the pole tip's angular velocity nearly zero. The function returns the Agent's first observation of the environment in the initial state. This observation is a special gym data type called Box(4), which is a sequence of 4 floating point numbers, which can take values in specific ranges. The meaning of the numbers and the ranges are defined in the Wiki in Gym's GitHub.
In the next lesson, we will talk about the dynamics in this Reinforcement Learning problem.

gutfeeling / practical_rl_for_coders

Lesson: The Agent and its Environment #16

Focus

What and why

Content

Summary