Add algorithms with regret guarantees

rlberry-py / rlberry

An easy-to-use reinforcement learning library for research and education.

https://rlberry-py.github.io/rlberry

MIT License

162 stars 30 forks source link

Add algorithms with regret guarantees #55

Closed szrlee closed 1 year ago

szrlee commented 3 years ago

Is your feature request related to a problem? Please describe. Feature request: Is there a plan for adding more algorithms with theoretical guarantees in the tabular setting or with linear function approximation with known features?

Describe the solution you'd like Tabular algorithm with regret bound E.g.

[ ] UCRL and its variants
[x] PSRL and its variants
[ ] EULER and its variants
[ ] UCBMQ (This is already implemented in https://github.com/omardrwch/ucbmq_code )
[ ] Q-EarlySettled-Advantage https://arxiv.org/abs/2110.04652

Exploration in Linear function approximation with known features e.g.

[ ] LSVI-UCB
[ ] UCRL-VTR
[ ] RLSVI
[ ] LSVI-PHI
[ ] Tabular implementation of the above feature-based algorithms. (Use one-hot feature and then reformulate the algorithm in the tabular representation.)

Describe alternatives you've considered Is rlberry framework suitable to implement these algorithms? If so, I am willing to contribute to some implementations of the above algorithms in this framework after getting familiar with rlberry.

omardrwch commented 3 years ago

Hello!

Indeed, one of our goals with rlberry is to provide an interface for agents that is simple enough for us to compare different kinds of algorithms: those with theoretical guarantees, deep RL, MCTS etc., so your contribution would be very welcome!

For instance, the script examples/demo_agent_stats.py shows how we can compare PPO to an algorithm with theoretical guarantees (RS-KernelUCBVI).

Remark: In our current implementations of theoretically grounded algorithms (UCBVI, OptQL, AdaptiveQL, RSKernelUCBVI, LSVI-UCB), we often need some tricks to make them more "practical" (i.e. not have a linear regret for a very long time), like adapting the constants in front of the exploration bonuses (which are often too conservative in theory). The bonuses described in the UCBMQ paper often work very well in the tabular setting, and we used a similar strategy to define the bonuses in the LSVI-UCB implementation.

szrlee commented 3 years ago

Remark: In our current implementations of theoretically grounded algorithms (UCBVI, OptQL, AdaptiveQL, RSKernelUCBVI, LSVI-UCB), we often need some tricks to make them more "practical" (i.e. not have a linear regret for a very long time), like adapting the constants in front of the exploration bonuses (which are often too conservative in theory). The bonuses described in the UCBMQ paper often work very well in the tabular setting, and we used a similar strategy to define the bonuses in the LSVI-UCB implementation.

Does the practical bonus term still hold theoretical guarantees? Or is it possible to use the practical bonus term to prove something meaningful in future research?

omardrwch commented 3 years ago

Does the practical bonus term still hold theoretical guarantees? Or is it possible to use the practical bonus term to prove something meaningful in future research?

In general, we lose the theoretical guarantees, except in simple cases as deterministic environments, where it is enough to visit each state at least once. The "practical" bonuses that we use are such that if n(s, a) = 0, the bonus is maximal ( = H, the horizon), which ensures that the agent goes at least once to each state. Then, for stochastic environments, we can play with the constant multiplying the 1/sqrt(n) term to control the amount of exploration.

To avoid over-conservative bonuses in theory, we can use tighter concentration inequalities (e.g., UCRL3 https://hal.archives-ouvertes.fr/hal-03000664/document) or try to develop algorithms that adapt to the structure of each MDP. Also, PSRL-like algorithms avoid these issues by not requiring bonuses at all.

TimotheeMathieu commented 2 years ago

A possibility would be to have a parameter that would be either "theoretical" or "heuristic" with the heuristic corresponding to the practical bonus used in practical applications and theoretical for research where we can use huge horizon with simplistic environment (e.g. very small environments) but at least we have theoretical guarentees.

I think it would make sense because rlberry is meant, at least in part, to be used for research.