Open YanickZengaffinen opened 5 months ago
This is what is learned on MDP2 with kldiv. Model reward is 0.0 => it's perfect at predicting the transitions that are taken. Avg Policy Reward on Learned Model: 0.73 => it's nowhere close to the 40 but that doesn't mean much.
More experiments needed (especially: actual reward of policy in real env)
In the GT-MBRL paper they use KLDiv as the objective, but we are currently using MSE.