ikostrikov / pytorch-a2c-ppo-acktr-gail

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).
MIT License
3.57k stars 829 forks source link

Extending support to continuous control environments #173

Closed Akella17 closed 5 years ago

Akella17 commented 5 years ago

I want to know the significance of squeeze operation (line: 162) in a2c_ppo_acktr/envs.py. The squeeze operation sends scalar values as action_value instead of singly-sized vectors for environments with action_dim = 1.

Suggested correction: Removing the squeeze operation allows training gym's continuous control environments in addition to existing environments.

actions = actions.squeeze(1).cpu().numpy()

actions = actions.cpu().numpy()

ikostrikov commented 5 years ago

Right now it works for me for gym's continuous control environments (mujoco).

Can you send me an example where it fails?

Akella17 commented 5 years ago

The algorithm works for MuJoCo either way. However, it fails for Gym's Classic Control environments (e.g. Pendulum-v0) or any other Gym environment with output_dim = 1. The squeeze operation converts the 1D action vector ([batch_size, 1] to [batch_size,]) to scalar while passing it to env.step() method, which in turn raises an error message.

What I have observed is that removing the squeeze() method makes the algorithm compatible with Classic Control environments while having no effect on MuJoCo or other working environments.

ikostrikov commented 5 years ago

Fixed in https://github.com/ikostrikov/pytorch-a2c-ppo-acktr/commit/88080da828dd4132bec0456b996e516fe356f75f