PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.57k
stars
829
forks
source link
leveraging parallel environments for sampling faster #242
It seems to me that your implementationis is not leveraging the usage of parallel environments very much, but I'm not sure. Please correct me if I'm wrong.
My understanding is that one hyperparameter can be the number of samples we obtain before each doing agent update. assuming we know the optimal value of this hyperparameter, using parallel environments should let us gather these samples faster.
one workaround could be for example changing the for step in range(args.num_steps): to a while loop that checks the total number of samples that is gathered from all environments.
nevertheless, I would appreciate hearing your view about this :)
-------edit:
I tried changing the for loop to a while loop like the following:
step = 0
while step < args.num_steps:
with torch.no_grad():
value, action, action_log_prob, recurrent_hidden_states = actor_critic.act(
rollouts.obs[step], rollouts.recurrent_hidden_states[step],
rollouts.masks[step])
obs, reward, done, infos = envs.step(action)
for info in infos:
if 'episode' in info.keys():
episode_rewards.append(info['episode']['r'])])
masks = torch.FloatTensor([[0.0] if done_ else [1.0] for done_ in done])
bad_masks = torch.FloatTensor([[0.0] if 'bad_transition' in info.keys() else [1.0] for info in infos])
rollouts.insert(obs, recurrent_hidden_states, action, action_log_prob, value, reward, masks, bad_masks)
step += 1 * args.num_processes
but the performance was't as good as if I had only used one environment.
It seems to me that your implementationis is not leveraging the usage of parallel environments very much, but I'm not sure. Please correct me if I'm wrong.
My understanding is that one hyperparameter can be the number of samples we obtain before each doing agent update. assuming we know the optimal value of this hyperparameter, using parallel environments should let us gather these samples faster.
If for example we need 4096 samples before each agent update, it seems to me that your implementation gathers 4096*num_processes before each update, and I wasn't sure if this necessarily boosts the learning speed. https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail/blob/84a7582477fb0d5c82ad6d850fe476829dddd2e1/main.py#L113
one workaround could be for example changing the
for step in range(args.num_steps):
to a while loop that checks the total number of samples that is gathered from all environments.nevertheless, I would appreciate hearing your view about this :)
-------edit: I tried changing the for loop to a while loop like the following:
but the performance was't as good as if I had only used one environment.