KLDiv instead of MSE - Githubissues

nilscrm / stackelberg-ml

0 stars 0 forks source link

KLDiv instead of MSE #27

Open YanickZengaffinen opened 5 months ago

YanickZengaffinen commented 5 months ago

In the GT-MBRL paper they use KLDiv as the objective, but we are currently using MSE.

Implement KLDiv bc it allows loss 0 with random transitions (MSE does not => model would be incentivized to choose short trajectories, as rewards are always negative)

YanickZengaffinen commented 5 months ago

This is what is learned on MDP2 with kldiv. Model reward is 0.0 => it's perfect at predicting the transitions that are taken. Avg Policy Reward on Learned Model: 0.73 => it's nowhere close to the 40 but that doesn't mean much.

More experiments needed (especially: actual reward of policy in real env)