PPO2 GPU usage? - Githubissues

andytwigg commented 5 years ago

On a 16-core machine with 1x1080TI GPU, running the following

python -m baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=1e6 --network=cnn --num_env=16

I observe around 1200 fps and GPU utilization (through nvidia-smi) peaking from ~5% to ~50% roughly every second or so. In any case, avg GPU utilization looks to be no more than 20%. Setting num_env=64 increases fps to ~1600 and GPU usage peaks to ~80% but less frequently. I suspect average is still low.

So increasing num_env or nsteps does not seem to help, since increasing the former much beyond ncpus means more communication overhead and stragglers; increasing the latter means that the gpu is idle for longer periods while subprocesses execute step.

I'm curious - are you guys (openai) seeing high gpu usage with ppo2? I suppose the culprit is that calls to model.train and model.step are both synchronized with subprocess step, so even with larger batch sizes in step from num_env, the master still spends a lot of time waiting. Does using MPI + vecenv alleviate this, or is there some other suggestion you have for making better use of my GPUs?

andytwigg commented 5 years ago

Update: running

mpirun -np 4 python -m
baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=1e6 --network=cnn --num_env=8

gives ~800 fps and similar GPU usage ~10-20% (although it uses 4x the gpu memory). So I'm not sure what to do with PPO2 here - @pzhokhov is this consistent with what you guys see?

I guess replay buffer-style algorithms such as dqn are easier to obtain higher gpu usage with since you can separate the problem of populating the buffer from training. Would it make sense to construct a variant of PPO2 where we fill a buffer from subprocesses and have a separate thread executing training on it?

pzhokhov commented 5 years ago

That is an interesting point... I'll check, but I think that is correct and we have similar GPU utilization statistics. One rather straightforward thing to try is to increase number of the optimizer epochs (and maybe nminibatches). You are correct, the current implementation does not parallelize steps and updates, so GPU will idle for at least as long as it takes a single environment to take nsteps steps (and likewise, env subprocessses will idle for at least as long as it takes to update the network parameters)

The variant of PPO with a buffer - technically, PPO training (like most policy gradient algorithms) requires the data to be on-policy, so if the actor processes use stale network copies (while learner computes the gradient and the updates) it may affect performance negatively. That being said, it may still be worth trying out - as @joschu points out, the data is only on-policy for the first minibatch of the update. Either way, before implementing parallel updates and steps, I think some profiling would be a good idea - imagine that env.step in the environments is the bottleneck. The buffer will let you do steps and gradient updates at the same time, but if steps is falling behind, GPU will crunch all the data in the buffer and idle nonetheless.

andytwigg commented 5 years ago

Have you been able to confirm GPU utilization?

imagine that env.step in the environments is the bottleneck. The buffer

will let you do steps and gradient updates at the same time, but if steps is falling behind, GPU will crunch all the data in the buffer and idle nonetheless.

I expect that is the case, but you can always increase num_envs so I don't see why this would be a problem.

On a related note, I wondered if I was misreading the MPI fps readings - it looked like it was slower than using subprocvecenv. What's the recommended usage here, because it looks like the code is going to end up being very complex if it tries to support arbitrary combinations of mpi+subprocvecenv. Do you have concrete instances where MPI is definitely better than subprocvecenv, and vice-versa?

Thanks Andy

On Tue, 23 Oct 2018 at 19:48, pzhokhov notifications@github.com wrote:

That is an interesting point... I'll check, but I think that is correct and we have similar GPU utilization statistics. One rather straightforward thing to try is to increase number of the optimizer epochs (and maybe nminibatches). But you are correct, the current implementation does not parallelize steps and updates, so GPU will idle for at least as long as it takes a single environment to take nsteps steps (and likewise, env subprocessses will idle for at least as long as it takes to update the network parameters)

The variant of PPO with a buffer - technically, PPO training (like most policy gradient algorithms) requires the data to be on-policy, so if the actor processes use stale network copies (while learner computes the gradient and the updates) it may affect performance negatively. That being said, it may still be worth trying out - as @joschu https://github.com/joschu points out, the data is only on-policy for the first minibatch of the update. Either way, before implementing parallel updates and steps, I think some profiling would be a good idea - imagine that env.step in the environments is the bottleneck. The buffer will let you do steps and gradient updates at the same time, but if steps is falling behind, GPU will crunch all the data in the buffer and idle nonetheless.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/649#issuecomment-432492986, or mute the thread https://github.com/notifications/unsubscribe-auth/ACXttcVxYlOfqcFhzChIUuPY7J7cQhwKks5un9UAgaJpZM4XaXku .

pzhokhov commented 5 years ago

On my GTX1070, driver 396.44, cuda 9.0 and tensorflow-gpu 1.11.0 I get the following: a) non-mpi, --num_env=16 - utilization bounces between 7 and 52 percent; reported fps is about 1300 b) mpi with 4 workers, --num_env==8 - utilization is between 20 and 80 percent, reported fps is about 430 (note that in MPI case that translates into actual fps of 430 * number of workers = 1720) c) non-mpi, --num_env=32 - utilization is between 7 and 70 percent, reported fps ~ 1550

so basically yes, I can see the same utilization pattern. Increasing num_envs to keep GPU busy - just as you've seen in your experiments, this works up until the CPUs cannot catch up - which also means that a simple way to get GPU more busy without any code changes is to get a machine with more CPUs and increase num_envs. MPI vs subprocvecenv - if you are planning to expand to a cluster of machines for training, MPI is necessary. For single-machine training, MPI vs subprocvecenv does not matter much (as you also have seen, the performance is about the same) for small neural nets. For large enough neural nets subprocvecenv is definitely better (because subprocvecenv keeps a single copy of network weights in the GPU, whereas MPI will have number_of_workers copies). I know that's a little generic; but if your use case is somewhere in between I'd recommend trying out both options and see which works best (i.e. do not trust me, trust experiments :))

andytwigg commented 5 years ago

Thanks for confirming. In my case, I'm interested in using multiple gpus. I have 2x GTX1080Ti and 16 CPUs. Is it possible to get the MPI mode to round-robin access to the gpus? ie equivalent to using CUDA_VISIBLE_DEVICES=0 for half the workers and =1 for the other half?

On Thu, 1 Nov 2018 at 15:41, pzhokhov notifications@github.com wrote:

On my GTX1070, driver 396.44, cuda 9.0 and tensorflow-gpu 1.11.0 I get the following: a) non-mpi, --num_env=16 - utilization bounces between 7 and 52 percent; reported fps is about 1300 b) mpi with 4 workers, --num_env==8 - utilization is between 20 and 80 percent, reported fps is about 430 (note that in MPI case that translates into actual fps of 430 * number of workers = 1720) c) non-mpi, --num_env=32 - utilization is between 7 and 70 percent, reported fps ~ 1550

so basically yes, I can see the same utilization pattern. Increasing num_envs to keep GPU busy - just as you've seen in your experiments, this works up until the CPUs cannot catch up - which also means that a simple way to get GPU more busy without any code changes is to get a machine with more CPUs and increase num_envs. MPI vs subprocvecenv - if you are planning to expand to a cluster of machines for training, MPI is necessary. For single-machine training, MPI vs subprocvecenv does not matter much (as you also have seen, the performance is about the same) for small neural nets. For large enough neural nets subprocvecenv is definitely better (because subprocvecenv keeps a single copy of network weights in the GPU, whereas MPI will have number_of_workers copies). I know that's a little generic; but if your use case is somewhere in between I'd recommend trying out both options and see which works best (i.e. do not trust me, trust experiments :))

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/649#issuecomment-435211418, or mute the thread https://github.com/notifications/unsubscribe-auth/ACXttbF82RFX_J1kWZO7X0KvkzrKXK_Mks5uq3icgaJpZM4XaXku .

pzhokhov commented 5 years ago

I think the simplest way to do this would be with bash. If you are using openMPI, something like this should work:

mpirun -np 4 bash -c 'CUDA_VISIBLE_DEVICES=$(($OMPI_COMM_WORLD_RANK % 2)) python -m baselines.run --alg=ppo2 ... '

Similarly, mpich also has an environment variable that corresponds to the process rank.

openai / baselines

PPO2 GPU usage? #649