gutfeeling commented 4 years ago

Chapter 1

Lesson: OpenAI Gym Installation

See https://github.com/gutfeeling/practical_rl_for_coders/issues/10

Lesson: Jupyter Installation

See https://github.com/gutfeeling/practical_rl_for_coders/issues/11

Lesson: Setting up a RL problem

https://github.com/gutfeeling/practical_rl_for_coders/issues/15

Exercise: Set up the MountainCar-v0 problem

Lesson: The Agent and its Environment

See https://github.com/gutfeeling/practical_rl_for_coders/issues/16

Exercise: In the MountainCar-v0 environment, print out the observation space. How many elements are there in the observation? Look up the GitHub Wiki and find out what the elements mean.

Lesson: Actions

See https://github.com/gutfeeling/practical_rl_for_coders/issues/17

Exercise: In the MountainCar-v0 environment, print out the action space. How many actions are possible in the environment? For each action, write a loop where that action is taken repeatedly for 30 steps, visualize what happens and try to guess what the action means. Confirm your guess by looking up the environment details in the GitHub Wiki.

Lesson: Rewards and Episodes

In CartPole-v0, we want the agent to learn to keep the pole upright for a certain length of time, which is 200 time steps. Let's call this the learning goal.
We express this learning goal as follows.
- We create a reward function. This means that the agent gets a reward or a punishment after taking an action in a given environment state. The reward depends on only the current environment state and the action. You can think of this as a judgment on the environment state and the action. If an environment state is considered good, then there will be a reward. If it is considered bad, there will be a punishment. Similarly, we can also make judgments on actions in a given environment state. If a certain action in a given environment state is good, we give a positive reward. If a certain action in a given environment state is bad, then we give a negative reward.
- The reward function should be such that maximizing the total reward over time is equivalent to reaching the learning goal.
- Explain how a reward function is created in CartPole-v0, so that maximizing the reward over time is the same as achieving the learning goal. In the process, explain the second return value reward of env.step().

Exercise: Calculate the average rewards obtained in CartPole-v0 over 100 episodes if the agent does random actions all the time.

Lesson: Other reward functions

Talk about the reward function in MountainCar-v0.
Talk about other reward functions in real world examples taken from academia and research.
Finally, talk about dopamine pathway in humans and the pursuit of happiness.

Lesson: Episodes

gutfeeling commented 4 years ago

Chapter 2

Lesson: Markov Decision Processes

In this course, we will learn to tackle the case where the dynamics of the agent environment interaction can be expressed as a Markov Decision Process.
Memoryless property i.e. transition probabilities to next state are only dependent on current state and action.
Reward is a function of current state and action.
Write down P and R for CartPole-v0.

Lesson: Policy

Define policy as pi(a | s).
Show three policies - opposite policy (move in a direction opposite to the pole), random and epsilon-opposite. Write down pi(s | a) in each case. Compute the average total reward per episode for the opposite and random cases in CartPole-v0.
Say that the goal is to find a policy that maximizes the rewards and that we will formalize this in the next lesson.

Exercise: Compute the average total rewards per episode for the epsilon opposite policy, with epsilon = 0.9. Where does it rank in terms of average reward compared to the random and the opposite policy?

Lesson: Value and Q value functions

Define value function.
Talk about various reasons why we use a discount factor.
Define action value function.
Calculate the value function and the action value function of states while following a random policy in CartPole-v0.

Exercise: Calculate value functions and action value functions of states while following a epsilon-opposite policy.

Lesson: Optimal Policy

Define the ordering of policies
Talk about the optimality theorem
Say that the our goal is to get to this optimal policy

Lesson: How humans learn gives us intuition of how to get to the optimal policy

Give an example of human learning involving exploration and exploitation

Lesson: GLIE Monte Carlo

Discuss the GLIE Monte Carlo algorithm in detail.
Say that it can be mathematically proven that epsilon policies are better policies.
Say that it can be mathematically proven that if the GLIE condition is satisfied, then we converge to the optimal policy.
Say that in the next Chapter, we will code up GLIE Monte Carlo and solve the CartPole-v0 env.

gutfeeling / practical_rl_for_coders

Course Plan #21

Chapter 1

Lesson: OpenAI Gym Installation

Lesson: Jupyter Installation

Lesson: Setting up a RL problem

Exercise: Set up the MountainCar-v0 problem

Lesson: The Agent and its Environment

Exercise: In the MountainCar-v0 environment, print out the observation space. How many elements are there in the observation? Look up the GitHub Wiki and find out what the elements mean.

Lesson: Actions

Lesson: Rewards and Episodes

Exercise: Calculate the average rewards obtained in CartPole-v0 over 100 episodes if the agent does random actions all the time.

Lesson: Other reward functions

Lesson: Episodes

Chapter 2

Lesson: Markov Decision Processes

Lesson: Policy

Exercise: Compute the average total rewards per episode for the epsilon opposite policy, with epsilon = 0.9. Where does it rank in terms of average reward compared to the random and the opposite policy?

Lesson: Value and Q value functions

Exercise: Calculate value functions and action value functions of states while following a epsilon-opposite policy.

Lesson: Optimal Policy

Lesson: How humans learn gives us intuition of how to get to the optimal policy

Lesson: GLIE Monte Carlo

Next chapters

Chapter 3: GLIE Monte Carlo implementation in Python

Chapter 4: SARSA

Chapter 5: Function approximation: Fourier transform

Chapter 6: Neural Network Crash Course

Chapter 7: Function approximation: Neural Network

Chapter 8: Vanilla Policy Gradient

Chapter 9: PPO

Chapter 10: RL on Google Cloud

Chapter 11: DQN