reward loss - Githubissues

jbr-ai-labs / mamba

This code accompanies the paper "Scalable Multi-Agent Model-Based Reinforcement Learning".

MIT License

46 stars 9 forks source link

Thanks for the thorough review of the code and the paper!

The main reason is getting information from both the current state and the next to identify agent’s reward and discount, which means the predictor functions can be deterministic, i.e. there is a bijection between $(st, s{t+1})$ and $r_t$, contrary to simply relating state to the reward (see Reward prediction section in Dreamer paper).

In practice, this combination of current stochastic state and next deterministic state works better than others (e.g. two stochastic states), in theory it should be the same.

It looks like the description for this is missing in the paper, sorry about that!

jbr-ai-labs / mamba

reward loss #7