Minimal Guarantees For SE

nilscrm / stackelberg-ml

0 stars 0 forks source link

Minimal Guarantees For SE #22

Open YanickZengaffinen opened 4 months ago

YanickZengaffinen commented 4 months ago

What are the minimum guarantees we need from our MDP, such that the SE cannot be arbitrarily bad? Some suggestions:

ergodicity
no final state
bounded reward

What might be extensions to the training algorithm, that can guarantee this irrespective of the MDP it is trained on? Some suggestions:

entropy penalty
various regularizations
adversary
model loss depending on reward of trajectories

YanickZengaffinen commented 4 months ago

Model loss depending on reward of trajectories

Adding negative reward for transition for non-query observations

Results:

When trained only on the reward of the policy (env_reward_weight = 1.0) and simple_mdp_2_variant_2, I get 37 reward on the true env (11 on the learned model but it makes sense this is lower bc there's no signal for the model to improve beyond this). Here's the final model:

On simple_mdp_2, I get 36.6 reward too. Here's the final model:

YanickZengaffinen commented 4 months ago

Problem can occur in ergodic MDPs (assume model hides s0 => optimal to always play a0 => will never discover s0 even exists)