Not really an issue, more a question

ghost commented 3 years ago

Hi, I want to reuse your experiment on MiniGrid as a benchmark to my paper on RL generalisation ... it fits nicely, but I am not clear how to replicate the experiment to generate the orange line on your paper, can you provide some insight ? Are your running the training on 2 000 000 environments to generate the chart ? Thanks a lot in advance.

ghost commented 3 years ago

Just to be more precise, I would like to train your agent on 1000 random environment and test it on 1000 other environment to get the generalisation percentage on these test environment ... not sure how I can do that with the code provided ... thanks

maximilianigl commented 3 years ago

Hi, thanks for your interest! We only have an explicit train/test split for the Coinrun environment. For MiniGrid, we randomly sample from all possible layouts during training. This doesn't allow us to explicitly measure the generalisation gap, but the performance of the agent (and their learning speed) still correlates with how well they generalise as the number of possible layouts is so large that they rarely see the same layout twice. So Figure 2 just shows the normal training performance we usually report in RL. Note that there's a lot of variation in the results, which is why ours are averaged over 30 random seeds.

ghost commented 3 years ago

Sure, so if I understood it well, you make iterations where you train on 3 environments randomly chosen and then test on another one also randomly chosen ? right ? the results in computed every 30 test as an average of reward over these 30 test environments ...

maximilianigl commented 3 years ago

For MiniGrid we're using the usual PPO setup (see here for hyperparameters:

We're running on 16 environments on parallel (--procs). Each environment is run entirely independently from the others, i.e. when we reach the end of the episode in one environment, we randomly sample a new layout in that environment and continue rollouts there. The layout sampling is not restricted in any way, so eventually we should have seen every possible layout during training. Generalisation is only important because there are so many of them; not sure how many exactly but your estimate of 2M could be correct, though maybe it's a bit less.
There's no explicit train/test split, i.e. we report the performance on each environment and also use that environment to subsequently train, like we often do in RL. The only major difference to most RL setups is in the randomness of layout generation inside the environment. From an algorithm/training perspective, we just run standard PPO.
In case you were referring to the N3r suffix in the environment name with 'train on 3 environments': The N3r only means that layouts are randomly generated with up to 3 rooms.

Not sure if that helps, please let me know if not - I feel like we might be talking past each other :).

microsoft / IBAC-SNI

Not really an issue, more a question #10