nilscrm / stackelberg-ml

0 stars 0 forks source link

Minimal Guarantees For SE #22

Open YanickZengaffinen opened 4 months ago

YanickZengaffinen commented 4 months ago

What are the minimum guarantees we need from our MDP, such that the SE cannot be arbitrarily bad? Some suggestions:

What might be extensions to the training algorithm, that can guarantee this irrespective of the MDP it is trained on? Some suggestions:

YanickZengaffinen commented 4 months ago

Model loss depending on reward of trajectories

Adding negative reward for transition for non-query observations

Results:

When trained only on the reward of the policy (env_reward_weight = 1.0) and simple_mdp_2_variant_2, I get 37 reward on the true env (11 on the learned model but it makes sense this is lower bc there's no signal for the model to improve beyond this). Here's the final model: image

On simple_mdp_2, I get 36.6 reward too. Here's the final model: image

YanickZengaffinen commented 4 months ago

image Problem can occur in ergodic MDPs (assume model hides s0 => optimal to always play a0 => will never discover s0 even exists)