Open YanickZengaffinen opened 2 months ago
This is what is learned on MDP2 with kldiv.
Model reward is 0.0 => it's perfect at predicting the transitions that are taken.
Avg Policy Reward on Learned Model: 0.73 => it's nowhere close to the 40 but that doesn't mean much.
More experiments needed (especially: actual reward of policy in real env)
In the GT-MBRL paper they use KLDiv as the objective, but we are currently using MSE.