openai / coinrun

Code for the paper "Quantifying Transfer in Reinforcement Learning"
https://blog.openai.com/quantifying-generalization-in-reinforcement-learning/
MIT License
390 stars 86 forks source link

Simulating deterministic resets for a single environment #39

Closed agarwl closed 4 years ago

agarwl commented 4 years ago

I have been trying to collect demonstrations from a trained PPO using a fixed coinrun environment (assume that the level_seed is set to the same value whenever done is True) , however, it seems that reset state of the environment depends on the actions executed before it.

Specifically, consider the rep value is set to 3, then the two other resets except 1st require the exact sequence of actions to be repeated from a newly instantiated env to get to the exact reset state. Due to this behavior, a long sequence of actions leading to multiple resets and rewards (whenever we solve the env) can't be split up into multiple demonstrations for that env and has to be used as single stream of experience.

Is there any change that I can make to coinrunenv.py to get around this problem?

kcobbe commented 4 years ago

If you would like to always get the same state on reset, you should set num-levels to 1. For example:

python -m coinrun.train_agent --run-id myrun --num-levels 1 --set-seed 13

The same flag will also work for the evaluation script. When you set num-levels to 1, the game state on reset will be independent of previous actions taken. You can change the particular level used by changing the value of set-seed.

agarwl commented 4 years ago

@kcobbe I should have been more clear, even when num-level is set to 1, the reset states returned are different for the single environment and depend on the sequence of actions.

For example, these are the observations returned by the environment when the done flag is True (here rep is set to 3). These observations are visibly different.

image

Config used:

frame_stack: 1
game_type: standard
high_difficulty: False
is_high_res: False
num_envs: 32
num_eval: 1
num_levels: 1
paint_vel_info: -1
rep: 3
set_seed: -1
test: False
test_eval: False
train_eval: True
use_batch_norm: 0
use_black_white: 0
use_color_transform: 0
use_data_augmentation: 0
use_inversion: 0