Closed szrlee closed 1 year ago
Hello!
Indeed, one of our goals with rlberry
is to provide an interface for agents that is simple enough for us to compare different kinds of algorithms: those with theoretical guarantees, deep RL, MCTS etc., so your contribution would be very welcome!
For instance, the script examples/demo_agent_stats.py
shows how we can compare PPO
to an algorithm with theoretical guarantees (RS-KernelUCBVI
).
Remark: In our current implementations of theoretically grounded algorithms (UCBVI
, OptQL
, AdaptiveQL
, RSKernelUCBVI
, LSVI-UCB
), we often need some tricks to make them more "practical" (i.e. not have a linear regret for a very long time), like adapting the constants in front of the exploration bonuses (which are often too conservative in theory). The bonuses described in the UCBMQ
paper often work very well in the tabular setting, and we used a similar strategy to define the bonuses in the LSVI-UCB
implementation.
Remark: In our current implementations of theoretically grounded algorithms (
UCBVI
,OptQL
,AdaptiveQL
,RSKernelUCBVI
,LSVI-UCB
), we often need some tricks to make them more "practical" (i.e. not have a linear regret for a very long time), like adapting the constants in front of the exploration bonuses (which are often too conservative in theory). The bonuses described in theUCBMQ
paper often work very well in the tabular setting, and we used a similar strategy to define the bonuses in theLSVI-UCB
implementation.
Does the practical
bonus term still hold theoretical guarantees? Or is it possible to use the practical
bonus term to prove something meaningful in future research?
Does the
practical
bonus term still hold theoretical guarantees? Or is it possible to use thepractical
bonus term to prove something meaningful in future research?
In general, we lose the theoretical guarantees, except in simple cases as deterministic environments, where it is enough to visit each state at least once. The "practical" bonuses that we use are such that if n(s, a) = 0
, the bonus is maximal ( = H
, the horizon), which ensures that the agent goes at least once to each state. Then, for stochastic environments, we can play with the constant multiplying the 1/sqrt(n)
term to control the amount of exploration.
To avoid over-conservative bonuses in theory, we can use tighter concentration inequalities (e.g., UCRL3 https://hal.archives-ouvertes.fr/hal-03000664/document) or try to develop algorithms that adapt to the structure of each MDP. Also, PSRL-like algorithms avoid these issues by not requiring bonuses at all.
A possibility would be to have a parameter that would be either "theoretical" or "heuristic" with the heuristic corresponding to the practical
bonus used in practical applications and theoretical for research where we can use huge horizon with simplistic environment (e.g. very small environments) but at least we have theoretical guarentees.
I think it would make sense because rlberry is meant, at least in part, to be used for research.
Is your feature request related to a problem? Please describe. Feature request: Is there a plan for adding more algorithms with theoretical guarantees in the tabular setting or with linear function approximation with known features?
Describe the solution you'd like Tabular algorithm with regret bound E.g.
Exploration in Linear function approximation with known features e.g.
Describe alternatives you've considered Is rlberry framework suitable to implement these algorithms? If so, I am willing to contribute to some implementations of the above algorithms in this framework after getting familiar with rlberry.