Exercise: In the MountainCar-v0 environment, print out the observation space. How many elements are there in the observation? Look up the GitHub Wiki and find out what the elements mean.
Exercise: In the MountainCar-v0 environment, print out the action space. How many actions are possible in the environment? For each action, write a loop where that action is taken repeatedly for 30 steps, visualize what happens and try to guess what the action means. Confirm your guess by looking up the environment details in the GitHub Wiki.
Lesson: Rewards and Episodes
In CartPole-v0, we want the agent to learn to keep the pole upright for a certain length of time, which is 200 time steps. Let's call this the learning goal.
We express this learning goal as follows.
We create a reward function. This means that the agent gets a reward or a punishment after taking an action in a given environment state. The reward depends on only the current environment state and the action. You can think of this as a judgment on the environment state and the action. If an environment state is considered good, then there will be a reward. If it is considered bad, there will be a punishment. Similarly, we can also make judgments on actions in a given environment state. If a certain action in a given environment state is good, we give a positive reward. If a certain action in a given environment state is bad, then we give a negative reward.
The reward function should be such that maximizing the total reward over time is equivalent to reaching the learning goal.
Explain how a reward function is created in CartPole-v0, so that maximizing the reward over time is the same as achieving the learning goal. In the process, explain the second return value reward of env.step().
Exercise: Calculate the average rewards obtained in CartPole-v0 over 100 episodes if the agent does random actions all the time.
Lesson: Other reward functions
Talk about the reward function in MountainCar-v0.
Talk about other reward functions in real world examples taken from academia and research.
Finally, talk about dopamine pathway in humans and the pursuit of happiness.
In this course, we will learn to tackle the case where the dynamics of the agent environment interaction can be expressed as a Markov Decision Process.
Memoryless property i.e. transition probabilities to next state are only dependent on current state and action.
Reward is a function of current state and action.
Write down P and R for CartPole-v0.
Lesson: Policy
Define policy as pi(a | s).
Show three policies - opposite policy (move in a direction opposite to the pole), random and epsilon-opposite. Write down pi(s | a) in each case. Compute the average total reward per episode for the opposite and random cases in CartPole-v0.
Say that the goal is to find a policy that maximizes the rewards and that we will formalize this in the next lesson.
Exercise: Compute the average total rewards per episode for the epsilon opposite policy, with epsilon = 0.9. Where does it rank in terms of average reward compared to the random and the opposite policy?
Lesson: Value and Q value functions
Define value function.
Talk about various reasons why we use a discount factor.
Define action value function.
Calculate the value function and the action value function of states while following a random policy in CartPole-v0.
Exercise: Calculate value functions and action value functions of states while following a epsilon-opposite policy.
Lesson: Optimal Policy
Define the ordering of policies
Talk about the optimality theorem
Say that the our goal is to get to this optimal policy
Lesson: How humans learn gives us intuition of how to get to the optimal policy
Give an example of human learning involving exploration and exploitation
Lesson: GLIE Monte Carlo
Discuss the GLIE Monte Carlo algorithm in detail.
Say that it can be mathematically proven that epsilon policies are better policies.
Say that it can be mathematically proven that if the GLIE condition is satisfied, then we converge to the optimal policy.
Say that in the next Chapter, we will code up GLIE Monte Carlo and solve the CartPole-v0 env.
Chapter 1
Lesson: OpenAI Gym Installation
Lesson: Jupyter Installation
Lesson: Setting up a RL problem
Exercise: Set up the MountainCar-v0 problem
Lesson: The Agent and its Environment
Exercise: In the MountainCar-v0 environment, print out the observation space. How many elements are there in the observation? Look up the GitHub Wiki and find out what the elements mean.
Lesson: Actions
Exercise: In the MountainCar-v0 environment, print out the action space. How many actions are possible in the environment? For each action, write a loop where that action is taken repeatedly for 30 steps, visualize what happens and try to guess what the action means. Confirm your guess by looking up the environment details in the GitHub Wiki.
Lesson: Rewards and Episodes
reward
ofenv.step()
.Exercise: Calculate the average rewards obtained in CartPole-v0 over 100 episodes if the agent does random actions all the time.
Lesson: Other reward functions
Lesson: Episodes