Open wasowski opened 2 years ago
There is also a weird thing in ExactRL
right now (when we merge the real sarsa/qlearning split, cf. issue #61). The interface requries a function to select the best action (off-policy) and to select an action (on-policy). First this is weird, because the best action needs to be implemented even for SARSA which does not use it - perhaps not so super bad, because on-policy chooseAction
can call bestAction
. In the light of this present issue, it is though a bit strange. Probably only bestAction
should be implemented for a problem, while chooseAction
should use the generalized exploration function discussed above, and be renamed to something like onPolicyAction
. The different algo's could then just mix in different policies.
Strangely, I do not recall right now whether there were algos that only differed in policies in Sutton & Barto (for instance proportional vs best). It seems that this is something that is problem dependent not algo dependent. We also don't presently have any tests that separate different policies.
Russel & Norvig p. 842, Fig 21.8 (p.844), all refs to 3rd edition, use a generalized exploration function, which allows for the agent to decrease or stop exploration over time. They define a function f (u,n), where u is the current estimated utility (reward) for a state and n is the number of visits to a considered state. A simple cut-off function is given as an example on p. 842, where one returns the reward estimate for all states visited more than treshold Ne times, and for states not visited sufficiently much, one returns an upper-bound on reward values (to encourage exploration of these states).
Implementing such a devise, instead of a fixed exploration ratio, would allow agents to be exploratory in more dynamic ways.
It appears that if we add such a facility, the counting of visits to states should likely also be confined to the same type/trait/device.