If gaps happen between RL decisions, the current behavior of Soar (if
temporal-extension is on) is to discount rewards and the propagated q-value
from the next state by the length of the gap.
This all makes sense in the context of Soar, but can make it difficult to
implement agents for evaluation purposes, since there is no easy way to
make a textbook RL agent where behavior doesn't depend on how many Soar
decisions happen between RL actions.
I added an rl parameter, temporal-discount, that can be used to disable
discounting based on decisions between RL updates. A patch is attached.
Original issue reported on code.google.com by sam.wint...@gmail.com on 3 Dec 2009 at 4:38
Original issue reported on code.google.com by
sam.wint...@gmail.com
on 3 Dec 2009 at 4:38Attachments: