Open andytwigg opened 5 years ago
Update: running
mpirun -np 4 python -m
baselines.run --alg=ppo2 --env=PongNoFrameskip-v4 --num_timesteps=1e6 --network=cnn --num_env=8
gives ~800 fps and similar GPU usage ~10-20% (although it uses 4x the gpu memory). So I'm not sure what to do with PPO2 here - @pzhokhov is this consistent with what you guys see?
I guess replay buffer-style algorithms such as dqn are easier to obtain higher gpu usage with since you can separate the problem of populating the buffer from training. Would it make sense to construct a variant of PPO2 where we fill a buffer from subprocesses and have a separate thread executing training on it?
That is an interesting point... I'll check, but I think that is correct and we have similar GPU utilization statistics. One rather straightforward thing to try is to increase number of the optimizer epochs (and maybe nminibatches). You are correct, the current implementation does not parallelize steps and updates, so GPU will idle for at least as long as it takes a single environment to take nsteps steps (and likewise, env subprocessses will idle for at least as long as it takes to update the network parameters)
The variant of PPO with a buffer - technically, PPO training (like most policy gradient algorithms) requires the data to be on-policy, so if the actor processes use stale network copies (while learner computes the gradient and the updates) it may affect performance negatively. That being said, it may still be worth trying out - as @joschu points out, the data is only on-policy for the first minibatch of the update. Either way, before implementing parallel updates and steps, I think some profiling would be a good idea - imagine that env.step in the environments is the bottleneck. The buffer will let you do steps and gradient updates at the same time, but if steps is falling behind, GPU will crunch all the data in the buffer and idle nonetheless.
Have you been able to confirm GPU utilization?
imagine that env.step in the environments is the bottleneck. The buffer
will let you do steps and gradient updates at the same time, but if steps is falling behind, GPU will crunch all the data in the buffer and idle nonetheless.
I expect that is the case, but you can always increase num_envs so I don't see why this would be a problem.
On a related note, I wondered if I was misreading the MPI fps readings - it looked like it was slower than using subprocvecenv. What's the recommended usage here, because it looks like the code is going to end up being very complex if it tries to support arbitrary combinations of mpi+subprocvecenv. Do you have concrete instances where MPI is definitely better than subprocvecenv, and vice-versa?
Thanks Andy
On Tue, 23 Oct 2018 at 19:48, pzhokhov notifications@github.com wrote:
That is an interesting point... I'll check, but I think that is correct and we have similar GPU utilization statistics. One rather straightforward thing to try is to increase number of the optimizer epochs (and maybe nminibatches). But you are correct, the current implementation does not parallelize steps and updates, so GPU will idle for at least as long as it takes a single environment to take nsteps steps (and likewise, env subprocessses will idle for at least as long as it takes to update the network parameters)
The variant of PPO with a buffer - technically, PPO training (like most policy gradient algorithms) requires the data to be on-policy, so if the actor processes use stale network copies (while learner computes the gradient and the updates) it may affect performance negatively. That being said, it may still be worth trying out - as @joschu https://github.com/joschu points out, the data is only on-policy for the first minibatch of the update. Either way, before implementing parallel updates and steps, I think some profiling would be a good idea - imagine that env.step in the environments is the bottleneck. The buffer will let you do steps and gradient updates at the same time, but if steps is falling behind, GPU will crunch all the data in the buffer and idle nonetheless.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/649#issuecomment-432492986, or mute the thread https://github.com/notifications/unsubscribe-auth/ACXttcVxYlOfqcFhzChIUuPY7J7cQhwKks5un9UAgaJpZM4XaXku .
On my GTX1070, driver 396.44, cuda 9.0 and tensorflow-gpu 1.11.0 I get the following: a) non-mpi, --num_env=16 - utilization bounces between 7 and 52 percent; reported fps is about 1300 b) mpi with 4 workers, --num_env==8 - utilization is between 20 and 80 percent, reported fps is about 430 (note that in MPI case that translates into actual fps of 430 * number of workers = 1720) c) non-mpi, --num_env=32 - utilization is between 7 and 70 percent, reported fps ~ 1550
so basically yes, I can see the same utilization pattern. Increasing num_envs to keep GPU busy - just as you've seen in your experiments, this works up until the CPUs cannot catch up - which also means that a simple way to get GPU more busy without any code changes is to get a machine with more CPUs and increase num_envs. MPI vs subprocvecenv - if you are planning to expand to a cluster of machines for training, MPI is necessary. For single-machine training, MPI vs subprocvecenv does not matter much (as you also have seen, the performance is about the same) for small neural nets. For large enough neural nets subprocvecenv is definitely better (because subprocvecenv keeps a single copy of network weights in the GPU, whereas MPI will have number_of_workers copies). I know that's a little generic; but if your use case is somewhere in between I'd recommend trying out both options and see which works best (i.e. do not trust me, trust experiments :))
Thanks for confirming. In my case, I'm interested in using multiple gpus. I have 2x GTX1080Ti and 16 CPUs. Is it possible to get the MPI mode to round-robin access to the gpus? ie equivalent to using CUDA_VISIBLE_DEVICES=0 for half the workers and =1 for the other half?
On Thu, 1 Nov 2018 at 15:41, pzhokhov notifications@github.com wrote:
On my GTX1070, driver 396.44, cuda 9.0 and tensorflow-gpu 1.11.0 I get the following: a) non-mpi, --num_env=16 - utilization bounces between 7 and 52 percent; reported fps is about 1300 b) mpi with 4 workers, --num_env==8 - utilization is between 20 and 80 percent, reported fps is about 430 (note that in MPI case that translates into actual fps of 430 * number of workers = 1720) c) non-mpi, --num_env=32 - utilization is between 7 and 70 percent, reported fps ~ 1550
so basically yes, I can see the same utilization pattern. Increasing num_envs to keep GPU busy - just as you've seen in your experiments, this works up until the CPUs cannot catch up - which also means that a simple way to get GPU more busy without any code changes is to get a machine with more CPUs and increase num_envs. MPI vs subprocvecenv - if you are planning to expand to a cluster of machines for training, MPI is necessary. For single-machine training, MPI vs subprocvecenv does not matter much (as you also have seen, the performance is about the same) for small neural nets. For large enough neural nets subprocvecenv is definitely better (because subprocvecenv keeps a single copy of network weights in the GPU, whereas MPI will have number_of_workers copies). I know that's a little generic; but if your use case is somewhere in between I'd recommend trying out both options and see which works best (i.e. do not trust me, trust experiments :))
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openai/baselines/issues/649#issuecomment-435211418, or mute the thread https://github.com/notifications/unsubscribe-auth/ACXttbF82RFX_J1kWZO7X0KvkzrKXK_Mks5uq3icgaJpZM4XaXku .
I think the simplest way to do this would be with bash. If you are using openMPI, something like this should work:
mpirun -np 4 bash -c 'CUDA_VISIBLE_DEVICES=$(($OMPI_COMM_WORLD_RANK % 2)) python -m baselines.run --alg=ppo2 ... '
Similarly, mpich also has an environment variable that corresponds to the process rank.
On a 16-core machine with 1x1080TI GPU, running the following
I observe around 1200 fps and GPU utilization (through nvidia-smi) peaking from ~5% to ~50% roughly every second or so. In any case, avg GPU utilization looks to be no more than 20%. Setting
num_env=64
increases fps to ~1600 and GPU usage peaks to ~80% but less frequently. I suspect average is still low.So increasing
num_env
ornsteps
does not seem to help, since increasing the former much beyondncpus
means more communication overhead and stragglers; increasing the latter means that the gpu is idle for longer periods while subprocesses executestep
.I'm curious - are you guys (openai) seeing high gpu usage with ppo2? I suppose the culprit is that calls to
model.train
andmodel.step
are both synchronized with subprocessstep
, so even with larger batch sizes instep
fromnum_env
, the master still spends a lot of time waiting. Does using MPI + vecenv alleviate this, or is there some other suggestion you have for making better use of my GPUs?