Persistent Advantage Learning

This PR includes one new agent to the control::td package. It is based on the work of Bellemare et al. (https://arxiv.org/pdf/1512.04860.pdf) to develop new operators that prevent divergence exhibit, in some cases, with the conventional Bellman operator.

One example script is provided and performance appears to be slightly better than that of conventional Q-Learning.

Note: more variants of this algorithm could indeed be added. We leave these to the interested parties to implement themselves, or until a later time when it is more appropriate.

tspooner / rsrl

Persistent Advantage Learning #30