Open YanickZengaffinen opened 4 months ago
When trained only on the reward of the policy (env_reward_weight = 1.0) and simple_mdp_2_variant_2, I get 37 reward on the true env (11 on the learned model but it makes sense this is lower bc there's no signal for the model to improve beyond this). Here's the final model:
On simple_mdp_2, I get 36.6 reward too. Here's the final model:
Problem can occur in ergodic MDPs (assume model hides s0 => optimal to always play a0 => will never discover s0 even exists)
What are the minimum guarantees we need from our MDP, such that the SE cannot be arbitrarily bad? Some suggestions:
What might be extensions to the training algorithm, that can guarantee this irrespective of the MDP it is trained on? Some suggestions: