nilscrm / stackelberg-ml

0 stars 0 forks source link

Inject Random Samples #19

Closed YanickZengaffinen closed 5 months ago

YanickZengaffinen commented 6 months ago

During training of the leader-model, if we randomly inject samples (which aren't sampled under the current best-responding follower-policy), this would force the leader-model to also be somewhat good in the rest of the environment. In the beginning, we want the number of injected samples to be high, avoiding that the leader-model can "cheat" by hiding part of the true world from the follower-policy, before gradually decreasing the number of random samples later on.

Angramme commented 5 months ago

Tried it on the easiest simple_mdp_2_variant_1 and got reward 29.320 ± 10.557. Here it the true env image and here is the learned mdp image

On the hardest case of simple_mdp_2 it does nothing.

Details: if temperature >= random() then proceed as before else choose random action uniformly among num_actions

temperature scales with number of total env steps, I settled on this function in the end:

lambda step: max(0.01, np.exp(-(step / model_config.total_training_steps) * -np.log(0.005)))

checkout inject_random_samples for code

Angramme commented 5 months ago

With the updated sampling strategy it now works better:

for simple_mdp2_variant_2: image

for simple_msp_2: image

Decay does not work at all... Partially random does not work...