[rllib] What kind of parallelization gains should be expected when using Ray?

Andrej4156 commented 4 years ago

I'm working on my first parallelized deep learning project and I'm trying to get a sense of whether my results are normal or if there might be something wrong with my implementation.

I am running a custom environment in OpenAI in a docker container. I have tried several different learning algorithms but I've gotten the best learning results using PPO and A3C. However, there is a huge difference in terms of scaled performance between the two. Here are the speed results I'm getting:

PPO: 1 worker = ~720 timesteps/second (ts/s), 10 workers = ~1730 ts/s (roughly 2.4x increase for 10x the resources) A3C: 1 worker = ~1000 ts/s, 10 workers ~5400 ts/s (5.4x increase for 10x resources)

PPO seems slow but it learns a lot faster than A3C so even at lower run speeds it outperforms A3C. Strangely, using htop I can see cpu thread utilization is at roughly 90% while the episode is running and then drops to around 30-40% for the duration of the iteration. It actually spends far more time at 30-40% than it does at 90+% which is strange.

I'm running this in a docker container and have replicated this on another machine so it would seem this is hardware independent. If it's relevant though, I'm running this on an intel 8700K (6 cores, 12 threads) and a GTX 1070TI (though I disabled the gpu because it actually hurts performance by 10-15%) with 64 GB of RAM.

I'm new this Ray and DL in general so I apologize if this is a normal phenomenon. I haven't been able to find anything elsewhere online so I figured I'd try here.

Ray version and other system information (Python version, TensorFlow version, OS): Latest Ray version, Python 3.7, Tensorflow version 2.1, Ubuntu 18.04

EDIT: I've done some additional testing and found the following: PPO: 2 workers = ~1050 ts/s ~ 5 workers = ~1480 ts/s Is it normal for returns to diminish this rapidly? If so, how do supercomputers with hundreds of CPUs or GPUs solve large problems?

rkooo567 commented 4 years ago

cc @sven1977

ericl commented 4 years ago

This sounds normal, RL scaling is complicated. PPO operates in two phases so you're probably seeing the SGD bottleneck come into play.

https://docs.ray.io/en/master/rllib-training.html#scaling-guide

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 3 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

ray-project / ray

[rllib] What kind of parallelization gains should be expected when using Ray? #8880