Inject Random Samples - Githubissues

YanickZengaffinen commented 6 months ago

During training of the leader-model, if we randomly inject samples (which aren't sampled under the current best-responding follower-policy), this would force the leader-model to also be somewhat good in the rest of the environment. In the beginning, we want the number of injected samples to be high, avoiding that the leader-model can "cheat" by hiding part of the true world from the follower-policy, before gradually decreasing the number of random samples later on.

Angramme commented 5 months ago

Tried it on the easiest simple_mdp_2_variant_1 and got reward 29.320 ± 10.557. Here it the true env and here is the learned mdp

On the hardest case of simple_mdp_2 it does nothing.

Details: if temperature >= random() then proceed as before else choose random action uniformly among num_actions

temperature scales with number of total env steps, I settled on this function in the end:

lambda step: max(0.01, np.exp(-(step / model_config.total_training_steps) * -np.log(0.005)))

checkout inject_random_samples for code

Angramme commented 5 months ago

With the updated sampling strategy it now works better:

for simple_mdp2_variant_2:

for simple_msp_2:

Decay does not work at all... Partially random does not work...

nilscrm / stackelberg-ml

Inject Random Samples #19