More Sample Efficient MAL

YanickZengaffinen commented 1 month ago

Our improved MAL solutions currently take lots of samples but implementing model-learning as RL with a complex NN describing the agents "policy" might not be necessary, when essentially all we want to do is find a query-/context-vector that is close to the actual environment and gives a policy that performs well in the real env.

Bc our S and A are discrete, we could simply have a model of the environment that is continually extended (ie SxA->S is just the mean of all transitions we observed, similar to how it's done in DynaQ [see comparison to literature]), which might be more sample efficient.

Other approaches are also feasible.

It's worth always paying attention to what properties of Gerstgrasser we'd give up (ie. leader should see the query responses but get 0 reward for it will most likely be violated => not guarantees to converge to SE anymore)

Angramme commented 1 month ago

Yannicks comments

Yeah I think it could be more sample efficient (bc rn we are learning a NN over SxA->S, just to then get good query answers/learn a good policy in S->A). We actually started off with an implementation similar to what you are proposing (e.g check old train file, like here https://github.com/nilscrm/stackelberg-ml/blob/kldiv/stackelberg_mbrl/train.py ). Why we abanoned this was bc it wasnt learning and we attributed that to us not showing the leader its query answers properly but we now know its bc the model can cheat. I do think we can profit in terms of sample efficiency, heck we probably dont even need an NN as the model but can just return the average over what we sampled so far and where we dont have any samples we return random values until observed. BUT by doing so we are not following Gerstgrasser anymore, so we lose guarantees that we converge to a SE

Could then also take a more principled approach and after updating the model, we sample in parts of the real env where we still have high uncertainty (ie where the new best policy leads us but no policy has lead us before). Fixing the cheating issue could be done similarily to MAL + Noise (ie we sample a few random transitions from the real env)

Angramme commented 1 month ago

TODO try:

Approach 1: policy model parameters is the oracle query response vector directly. Special wrapper class for policy that exposes query response vector as parameters for the optimizer, the model is just the true env. Then input this into PPO. If needed, add noise to the wrapper policy such that random trajectories are picked sometimes.
Approach 2: Do as Yannick said, construct the oracle context based on the samples collected so far, potentially add noise to explore more (although this should be handled by PPO?). Special wrapper for Policy that constructs the oracle vector based on step observations, again the env is just the true env. Again input those into PPO.

YanickZengaffinen commented 1 month ago

For completeness, here's pseudocode of what I'm proposing.

// follower pretraining pretrain policy for all possible models (as we do rn) // 0x samples from real env until this point

// leader training initial_models = draw a few random models initial_policies = get the best-response policies for the initial_models samples = rollout the initial_policies model = mean(samples, axis=0) in R^{(SA)xS} // random if no samples are available old_model = -inf in R^{(SA)xS} while |model - old_model| > eps: // while the model is still changing (maybe find better criteria) samples.extend(rollout the best response policy to the current model) old_model = model model = mean(samples, axis=0)

FAQ:

why do we need multiple models in the beginning? drawing multiple random models combined with the fact that the model can never "unlearn" should increase the probability of parts of the world being hidden. Obviously this is no guarantee
why do we sample models and then get policies, couldn't we just use random policies? should work too with random policies but why not use the best-response policies when we have them. unclear which is best though

nilscrm / stackelberg-ml

More Sample Efficient MAL #41