Trying to understand hardware limitations for parallelizing PPO2 [question]

SerialIterator commented 5 years ago

Describe the question As far as I understand, when using a GPU, SubprocVecEnv runs multiple workers each running their own environment on a GPU and then updates the model when it has gathered all synchronous rollouts. When I set n_cpu = 8 I should expect 8 workers (envs) to be initialized and ran on GPU. I assume this would be parallelized to 8 CUDA cores. When I run a PPO2 model with n_cpu=8 I can see the GPU utilized then the CPU as if the rollouts are being pushed to the CPU to update the model. My GPU has thousands of CUDA cores but I appear to only get a training speed-up until n_cpu=32 after that, the CPU seems like it's running for a very long time between nupdates. From what I can see, my CPU is unable to handle the amount of rollouts from n_cpu= >32? Am I correct that the workers are on GPU and model updates are done on CPU? That would mean that many CPU cores are necessary to handle the increase in workers created at a linear rate to CUDA cores?

Code example

import gym
import money_maker
import os

from stable_baselines.common.policies import MlpLstmPolicy
from stable_baselines.common.vec_env import SubprocVecEnv
from stable_baselines import PPO2

# multiprocess environment
n_cpu = 32
env = SubprocVecEnv([lambda: gym.make('maker-v0') for i in range(n_cpu)])

model = PPO2(MlpLstmPolicy, env, verbose=1, nminibatches=2, n_steps=256,
             learning_rate=2.55e-4, gamma=0.999,
             tensorboard_log="./ppo2_lstm_21_jan_morn_tensorboard/")

model.learn(total_timesteps=100000000)
model.save("ppo2LstmN32mini2steps256")
del model
env.close()

System Info Describe the characteristic of your environment:

Local install version 2.4.1 using pip install -e. in conda environment
dual gtx-180ti
6 core Intel-I7
3.6.5
Tensorflow version 1.12.0
Versions of any other relevant libraries N/A

Additional context Trying to figure out a rough calculation for maximum hardware performance of stable-baselines PPO2 so I can efficiently fire up a cloud instance without too much trial and error.

SerialIterator commented 5 years ago

I think i have this backwards huh? Workers are on CPU and policy updates are done on Gpu?

araffin commented 5 years ago

Workers are on CPU and policy updates are done on Gpu?

yes. The simulation is run on the cpu and then gradient updates are done on the gpu (every n_steps for ppo2).

You should also be aware that a CUDA core is quite different from a CPU core.

SerialIterator commented 5 years ago

Thanks. So workers are on CPU and gradient updates are on GPU. From what I've seen from fiddling with different environments so far is that the amount of cpu cores and ram is the limiting factor for how many workers you can run. The more workers, the more "exploration" you can do for a given time period. But, the larger the environment (stacked frame CnnLstm for example) would use far more ram which would then be the limiting factor. The Gpu should be more efficiently used with more workers as data transfer is slower but throughput is higher right? The limiting factor of a GPU would be its memory to hold all the rollouts and/or the amount of parameters for the policy?

hill-a / stable-baselines

Trying to understand hardware limitations for parallelizing PPO2 [question] #201