Closed Marioooooooooooooo closed 1 year ago
Thanks for the thorough review of the code and the paper!
The main reason is getting information from both the current state and the next to identify agent’s reward and discount, which means the predictor functions can be deterministic, i.e. there is a bijection between $(st, s{t+1})$ and $r_t$, contrary to simply relating state to the reward (see Reward prediction section in Dreamer paper).
In practice, this combination of current stochastic state and next deterministic state works better than others (e.g. two stochastic states), in theory it should be the same.
It looks like the description for this is missing in the paper, sorry about that!
Thank you very much for your excellent work, I want to consult you about something I don't understand. Why the reward loss and the discount loss of the (t+1)-th time step are calculated using the (t+1)-th deter (hidden state $h{t+1}$) and t-th stoch information (namely, $z{t}$). The corresponding code is in loss.py line 29 and line 30