lcswillems / torch-ac

Recurrent and multi-process PyTorch implementation of deep reinforcement Actor-Critic algorithms A2C and PPO
MIT License
190 stars 64 forks source link

ParallelEnv class yields non-correct rewards in a minigrid environment #3

Open ycemsubakan opened 4 years ago

ycemsubakan commented 4 years ago

I tried to use the parallelenv class for creating parallel episodes. I used this minigrid environment: https://github.com/maximecb/gym-minigrid/blob/master/README.md (with MiniGrid-Empty-5x5-v0) The rewards should be (1 - c*time_taken_toreachgreen) (where c is a constant), but it seems when I use the parallelenv , rewards do not follow this. I am actually observing that the rewards increase with time. Example: Say we have 10 step episodes. Normally we should be observing this type of rewards: [0, 0, 0.95, 0, 0, 0.9, 0, 0, 0.85, 0] (this is a list where the first element is the reward obtained at t=0, second element is the reward at t=1, and so on. ) But, I am observing rewards like this with ParallelEnv(): [0, 0, 0.95, 0, 0, 0.95, 0, 0, 0.95, 0], or even increasing rewards like the following : [0, 0, 0.85, 0, 0, 0.90, 0, 0, 0.95, 0]

I might be misunderstanding the purpose of the ParallelEnv class: My understanding was that it is supposed to give totally independent episodes, and it shouldn't disrupt the original reward structure? It would be great if you could let me know how I could fix this. Thank you!

lcswillems commented 4 years ago

ParallelEnv just runs agent on environments in parallel. I don't see the link with reward. Could you say me how to reproduce the bug?

ycemsubakan commented 4 years ago

I have written my own code for this, I will try to push it shortly which showcases the bug. But just try the environment MiniGrid-Empty-5x5-v0, and compare the rewards within an episode with and without ParallelEnv. (Even if you use 1 environment with ParallelEnv the rewards are not correct, I am guessing something is wrong with time indexing?)

ayakayal commented 1 year ago

I have written my own code for this, I will try to push it shortly which showcases the bug. But just try the environment MiniGrid-Empty-5x5-v0, and compare the rewards within an episode with and without ParallelEnv. (Even if you use 1 environment with ParallelEnv the rewards are not correct, I am guessing something is wrong with time indexing?)

Hello, did you manage to fix the bug? Because I am currently trying to test the same thing