Typically, reinforcement learning must learn a "value function" - the value of being in state s at time t, accounting for all future rewards. This function is empirical.
We, however, have a twin that can - in absence of policy - perfectly predict future rewards. We've been looking for a place to integrate such a thing, and the value function is it.
As above.
Typically, reinforcement learning must learn a "value function" - the value of being in state s at time t, accounting for all future rewards. This function is empirical.
We, however, have a twin that can - in absence of policy - perfectly predict future rewards. We've been looking for a place to integrate such a thing, and the value function is it.
Question: what state vector should the DT be fed?