Open ghost opened 3 years ago
Just to be more precise, I would like to train your agent on 1000 random environment and test it on 1000 other environment to get the generalisation percentage on these test environment ... not sure how I can do that with the code provided ... thanks
Hi, thanks for your interest! We only have an explicit train/test split for the Coinrun environment. For MiniGrid, we randomly sample from all possible layouts during training. This doesn't allow us to explicitly measure the generalisation gap, but the performance of the agent (and their learning speed) still correlates with how well they generalise as the number of possible layouts is so large that they rarely see the same layout twice. So Figure 2 just shows the normal training performance we usually report in RL. Note that there's a lot of variation in the results, which is why ours are averaged over 30 random seeds.
Sure, so if I understood it well, you make iterations where you train on 3 environments randomly chosen and then test on another one also randomly chosen ? right ? the results in computed every 30 test as an average of reward over these 30 test environments ...
For MiniGrid we're using the usual PPO setup (see here for hyperparameters:
--procs
). Each environment is run entirely independently from the others, i.e. when we reach the end of the episode in one environment, we randomly sample a new layout in that environment and continue rollouts there. The layout sampling is not restricted in any way, so eventually we should have seen every possible layout during training. Generalisation is only important because there are so many of them; not sure how many exactly but your estimate of 2M could be correct, though maybe it's a bit less. N3r
suffix in the environment name with 'train on 3 environments': The N3r
only means that layouts are randomly generated with up to 3 rooms. Not sure if that helps, please let me know if not - I feel like we might be talking past each other :).
Hi, I want to reuse your experiment on MiniGrid as a benchmark to my paper on RL generalisation ... it fits nicely, but I am not clear how to replicate the experiment to generate the orange line on your paper, can you provide some insight ? Are your running the training on 2 000 000 environments to generate the chart ? Thanks a lot in advance.