Open YanickZengaffinen opened 1 month ago
Tried to go as simple as possible with this one (initial_state: 0, final_state: None): Here MAL actually achieves 243.2 reward, which is pretty close to the max 246. Probably too simple / the chance of visiting s1 is too high. For reference, the final model that is learned:
Here I tried to avoid self-loops (initial_state: 0, final_state: None): On this MAL achieved a reward of 223. As you can see it actually discovered the best loop:
Even in this 4 state MDP (initial_state: 0, final_state: None) MAL is learning sth (achieves 267 out of 485 reward) with the following model
Here, a run on an MDP that's only ergodic but not deterministic (initial_state: 1, final_state: None): MAL achieves 9.8 reward and the final model is:
Branch: https://github.com/nilscrm/stackelberg-ml/tree/more-mdps