Goals at the beginning of the course

Why RL?
- Underpinning of many human and animal behavior.
- General enough learning framework for AGI.
Inspiration
- RL breakthroughs
- RL in industry
Quick win
- Programming a random policy in OpenAI Gym for CartPole-v0.

How to integrate above goals in lesson plan?

Value based methods vs. policy based methods and when they are used. Typically, when the actions are discrete and small, value based methods are used. When the action space is continuous, policy based methods are used.
Optimization between long and short term reward, leading to discount factor. Relation to the credit assignment problem. This is integrated into the discounted reward sum (optimization goal for value based methods) and the score function (optimization goal for policy based methods).
Tabular methods vs. function approximation based methods and when they are used. Typically, when the environment space is small, tabular methods can be used. The moment the environment space starts becoming large, tabular methods take longer to converge, and function approximation approaches are preferred.
Optimization between exploration and exploitation, leading to the epsilon factor and epsilon annealing for value based methods and random distributions in policy based methods.
Rewards are automatic in humans/animals, but must be engineered for machines. This is often difficult.
Environment transitions are automatic in the real world, but the exploration phase in RL can cause damages in real environments. Therefore, simulations must be used in many cases so that exploration can be done safely. In simulations, environment transitions must also be simulated. This is also difficult.
Bias (SARSA(0)) vs. variance (Monte Carlo). Leads to SARSA(lambda).