Closed YanickZengaffinen closed 5 months ago
Tried it on the easiest simple_mdp_2_variant_1 and got reward 29.320 ± 10.557. Here it the true env and here is the learned mdp
On the hardest case of simple_mdp_2 it does nothing.
Details: if temperature >= random() then proceed as before else choose random action uniformly among num_actions
temperature scales with number of total env steps, I settled on this function in the end:
lambda step: max(0.01, np.exp(-(step / model_config.total_training_steps) * -np.log(0.005)))
checkout inject_random_samples for code
With the updated sampling strategy it now works better:
for simple_mdp2_variant_2:
for simple_msp_2:
Decay does not work at all... Partially random does not work...
During training of the leader-model, if we randomly inject samples (which aren't sampled under the current best-responding follower-policy), this would force the leader-model to also be somewhat good in the rest of the environment. In the beginning, we want the number of injected samples to be high, avoiding that the leader-model can "cheat" by hiding part of the true world from the follower-policy, before gradually decreasing the number of random samples later on.