Closed denyHell closed 4 years ago
Overall this topic is covered and discussed in papers like original A3C (extra "A" for "Asynchronous") and IMPALA (a follow-up/improvement). These come with their own difficulties, as discussed in the papers, which is why they are not included in stable-baselines.
The suggested idea could improve performance, but hard to say by how much (heavily depends on the environment). It would require significant modifications to the core algorithms, and would be easier to do with PyTorch in stable-baselines3.
I noticed that for the vectorized environments, all of environment will take a step and will wait for all the steps to finish before starting to take another step. The issue with my customized environment is that step involving a reset() takes much longer than a usual step. In this case, all the other environments will have to wait for the one in reset() to finish, which makes it very inefficient.
I wonder if there is a way to handle issue like this. Ideally, we would like the agent to be able to interact with each environment independently through alternate calls of model.step() and env.step(), while keeping a counter of the number of steps that has been take in each individual environment. It will stop collecting experience once the total sum of number of steps reaches certain fixed threshold.