nilscrm / stackelberg-ml

0 stars 0 forks source link

More Sample Efficient MAL #41

Open YanickZengaffinen opened 1 month ago

YanickZengaffinen commented 1 month ago

Our improved MAL solutions currently take lots of samples but implementing model-learning as RL with a complex NN describing the agents "policy" might not be necessary, when essentially all we want to do is find a query-/context-vector that is close to the actual environment and gives a policy that performs well in the real env.

Bc our S and A are discrete, we could simply have a model of the environment that is continually extended (ie SxA->S is just the mean of all transitions we observed, similar to how it's done in DynaQ [see comparison to literature]), which might be more sample efficient.

Other approaches are also feasible.

It's worth always paying attention to what properties of Gerstgrasser we'd give up (ie. leader should see the query responses but get 0 reward for it will most likely be violated => not guarantees to converge to SE anymore)

Angramme commented 1 month ago

Yannicks comments

Yeah I think it could be more sample efficient (bc rn we are learning a NN over SxA->S, just to then get good query answers/learn a good policy in S->A). We actually started off with an implementation similar to what you are proposing (e.g check old train file, like here https://github.com/nilscrm/stackelberg-ml/blob/kldiv/stackelberg_mbrl/train.py ). Why we abanoned this was bc it wasnt learning and we attributed that to us not showing the leader its query answers properly but we now know its bc the model can cheat. I do think we can profit in terms of sample efficiency, heck we probably dont even need an NN as the model but can just return the average over what we sampled so far and where we dont have any samples we return random values until observed. BUT by doing so we are not following Gerstgrasser anymore, so we lose guarantees that we converge to a SE

Could then also take a more principled approach and after updating the model, we sample in parts of the real env where we still have high uncertainty (ie where the new best policy leads us but no policy has lead us before). Fixing the cheating issue could be done similarily to MAL + Noise (ie we sample a few random transitions from the real env)

Angramme commented 1 month ago

TODO try:

YanickZengaffinen commented 1 month ago

For completeness, here's pseudocode of what I'm proposing.

// follower pretraining pretrain policy for all possible models (as we do rn) // 0x samples from real env until this point

// leader training initial_models = draw a few random models initial_policies = get the best-response policies for the initial_models samples = rollout the initial_policies model = mean(samples, axis=0) in R^{(SA)xS} // random if no samples are available old_model = -inf in R^{(SA)xS} while |model - old_model| > eps: // while the model is still changing (maybe find better criteria) samples.extend(rollout the best response policy to the current model) old_model = model model = mean(samples, axis=0)

FAQ: