jbr-ai-labs / mamba

This code accompanies the paper "Scalable Multi-Agent Model-Based Reinforcement Learning".
MIT License
46 stars 9 forks source link

reward loss #7

Closed Marioooooooooooooo closed 1 year ago

Marioooooooooooooo commented 1 year ago

Thank you very much for your excellent work, I want to consult you about something I don't understand. Why the reward loss and the discount loss of the (t+1)-th time step are calculated using the (t+1)-th deter (hidden state $h{t+1}$) and t-th stoch information (namely, $z{t}$). The corresponding code is in loss.py line 29 and line 30

vladimirrim commented 1 year ago

Thanks for the thorough review of the code and the paper!

The main reason is getting information from both the current state and the next to identify agent’s reward and discount, which means the predictor functions can be deterministic, i.e. there is a bijection between $(st, s{t+1})$ and $r_t$, contrary to simply relating state to the reward (see Reward prediction section in Dreamer paper).

In practice, this combination of current stochastic state and next deterministic state works better than others (e.g. two stochastic states), in theory it should be the same.

It looks like the description for this is missing in the paper, sorry about that!