Open christianadriano opened 2 years ago
As the agents are using the difference between the two steps' utility to calculate the reward, the change of the reward would have an impact on the learning. As the critic is used to predict the value of a certain state, the whole prediction could be wrong. All the weights have to be adapted (retrained) in order to have a correct prediction after a drift of the utility. Therefore I'm not sure whether the current implementation is capable of handling a change.
@christianadriano Where can I find information on how the previous project has solved this?
Currently, the utility produced by mRubis is stochastic but it is stationary. In the future we might want to relax this assumption of stationarity.