How MBRL Approaches Fit Into Gerstgrasser Framework

Actually I changed my mind about this and you might be correct with your hunch that the Game Theoretic MBRL approach might not fit into the Gerstgrasser Framework. I think all assumptions of Lemma 1 of the Gerstgrasser paper are satisfied and thus the the optimum of the learning problem is actually the Stackelberg equilibrium (that's what I was always referring to). However, even though the follower is implemented as a query oracle (I think you can see RL as a type of oracle implementation) we don't shows these queries (the learning process of the follower) to the leader. The paper states

If the follower oracle is implemented using RL, i.e., both leader and followers use RL, then the initial segment is simply one or more episodes of M where the followers are learning, and the final segment is one episode from M where the followers have converged.

Note the initial segments of episodes of M where the followers are learning. That means we need to show the trajectories that we use to train the follower also to the leader. So one trajectory of the leader is a lot of trajectories of the follower (with no rewards) and then one trajectory with rewards. This is not done by the implementation of the MBRL paper.

Since we don't include the training process we have an immediately-best-responding follower which can diverge.

nilscrm / stackelberg-ml

How MBRL Approaches Fit Into Gerstgrasser Framework #8