nilscrm / stackelberg-ml

0 stars 0 forks source link

KLDiv instead of MSE #27

Open YanickZengaffinen opened 2 months ago

YanickZengaffinen commented 2 months ago

In the GT-MBRL paper they use KLDiv as the objective, but we are currently using MSE.

YanickZengaffinen commented 2 months ago

This is what is learned on MDP2 with kldiv. Model reward is 0.0 => it's perfect at predicting the transitions that are taken. Avg Policy Reward on Learned Model: 0.73 => it's nowhere close to the 40 but that doesn't mean much. image

More experiments needed (especially: actual reward of policy in real env)