2003-Shaping and policy search in Reinforcement Learning-Andrew Y. Ng

This is the Ph.D. thesis of Andrew Y. Ng.

Choosing the reward function is hard. This dissertation gives a theory of reward shaping. This theory further gives guidelines for selecting good shaping rewards that in practice give significant speedups of the learning process. It proposed the PEGASUS policy search method and used it in control of the helicopter.

PEGASUS: a policy search method for large MDPs and POMDPs. problems for helicopter control: (1) delayed consequences. (2) partial observability.

Sone of the issues that make certain reinforcement learning problem challenging: (1) High dimension (2) how to choose the reward function. (3)Partial observability (4) hard to reuse the collected data

reward shaping, refers to the practice of choosing or modifying a reward function to help algorithms learn.

RL and (PO)MDPs (PS: I don't know the reward function can be defined as so many kinds of forms)

The relationship: similar:

(PS: for POMDP, the following may be not right)

Curse of dimension: the number of states grows exponentially.

The advantage compared to Reinforce algorithm: Reinforce uses data sampled from the MDP once to take a small uphill step and then throws away the data. However, our method will reuse the data obtained from the MDP. This will allow us to derive more efficient algorithms, for which we can prove nontrivial performance guarantees.

Finding good solutions to POMDP is hard and policy search based methods can work. Policy search methods generalize more eaxily than dynamic programming or value-function-based methods to the setting of POMDPs.

Shaping in RL Reward shaping refers to the practice of modifying the reward function to provide guidance or give "hints" to a learning agent to help it to learn faster. The aim is wishing the optimal policies to be invariant to the changes we make to the reward function, so the good policies learned using shaping rewards will still be good for the original problem. That is, provide systematic ways of modifying a reward function while guaranteeing that the resulting learnd policies will still be good.

is a bounded real-valued function called shaping reward function.

(PS: for detailed proof, please read the original paper)

PEGASUS-----Policy Evaluation of Goodness And Search Using Scenarios The advantages of policy search methods compared to dynamic-programming and value function based solutions: for many MDPs, the value and Q function can be complicated and difficult to approximate. A compact Q-function means a good policy, but there is no guarantee that the existence of a good policy implies a good Q-function. The disadvantage lies in that it may be prone to local optima and computationally expensive.

The VC dimension for RL: (similar to the conception in supervised learning)