Improve Utilization of GPU

notadamking commented 5 years ago

This library achieves very high success rates, though it takes a very long time to optimize and train. This could be improved if we could figure out a way to utilize the GPU more during optimization/training, so the CPU can be less of a bottleneck. Currently, the CPU is being used for most of the intermediate environment calculations, while the GPU is used within the PPO2 algorithm during policy optimization.

I am currently optimizing/training on the following hardware:

AMD Threadripper 1920X 12 Core (24 Thread) CPU
Nvidia RTX 2080 8GB GPU
16 GB 3000 Mhz RAM

The bottleneck on my system is definitely the CPU, which is surprising as this library takes advantage of the multi-threaded benefits of the Threadripper, and my GPU is staying around 1-10% utilization. I have some ideas on how this could be improved, but would like to start a conversation.

Increase the size of the policy network (i.e. increase the number of hidden layers or increase the number of nodes in each layer)
Do less work in each training loop, so the GPU loop is called more often.

I would love to hear what you guys think. Any ideas or knowledge is welcome to be shared here.

laneshetron commented 5 years ago

Well, I believe you could swap out some of the numpy logic in your environment with tensorflow methods, which should be eagerly run on the GPU. Also, I haven't done any profiling or anything, but I'd guess that fitting the SARIMAX model on each observation step is very slow. Perhaps it could be precomputed?

Just some ideas!

notadamking commented 5 years ago

Great idea to replace some of the numpy logic with tensorflow. Though, I am curious to see how much of improvement this will yield, as there aren't many large calculations done in numpy. Perhaps more impactful, would be to replace the sklearn scaling methods with tensorflow methods, since we re-scale the entire data frame on each step.
Pre-computing the SARIMAX is another great idea. Since it is currently calculated on each time step, any time we reset the environment we will be re-calculating all of the same SARIMAX predictions at each time step.

archenroot commented 5 years ago

I have 32 thread dual xeon with dual 1080 ti and cpu only 100& single thread, 2 others around 50, 2 others under 20 and rest under 10 percent, so low load on cpu super low, gpus are one constantly at 0 and other max 1 percent..:-)

archenroot commented 5 years ago

I didn't do full profiling, but one can use py-spy similar to iotop,top to watch behavior:

botemple commented 5 years ago

Same here. how to fully utilized with multi-GPU usage to train the agents. even could distributed-ready will be even better.

TalhaAsmal commented 5 years ago

@archenroot have you tried increasing the parallelism? Increase n_jobs in optimize.py to a value equal to the number of cores you have, and it should increase utilization.

archenroot commented 5 years ago

@TalhaAsmal - well, it doesn't work at least from 10, i tried even 64 :-) as reported in other issue, there is then problem with concurrency assess into sqlite. On the other hand I replaced sqlite with Postgres engine, but optuna doesn't at moment support (there is PR for it already waiting to merge) custom parameters (pool_size, etc.), so SQLAlchemy is failing as well on default config with Postgres, once optuna has merged new PR, we can achieve this heavy parallelizm, but it doesn't work as of now....

After 2 days, at around 400 trials finished with 4 threads :D (brutal race), also lot of those are PRUNED as marked at early stages as unpromising... will se what config I will get ... I think another 2-3 days...

archenroot commented 5 years ago

@TalhaAsmal - I will try today evening install custom optuna branch with requested fix for custom driver params with Postgres.

TalhaAsmal commented 5 years ago

@TalhaAsmal - I will try today evening install custom optuna branch with requested fix for custom driver params with Postgres.

@archenroot did you manage to try it with the custom optuna branch? I also ran into concurrency issues with sqlite, but since I have a very old CPU (2600k) I just reduced the parallelism to 2, with obvious negative consequences for speed.

dennywangtenk commented 5 years ago

use SubprocVecEnv instead of DummyVecEnv should improve a lot in multiple-CPU environment.

Here are why: According to baselines doc Vectorized Environments,

DummyVecEnv Creates a simple vectorized wrapper for multiple environments, calling each environment in sequence on the current Python process. SubprocVecEnv Creates a multiprocess vectorized wrapper for multiple environments, distributing each environment to its own process, allowing significant speed up when the environment is computationally complex.

In my own experiments with Atari games, using SubprocVecEnv improved performance 200% ~ 250%.

And there is more, according @zuoxingdon method at here. "Chunk" based VecEnv could boost performance for additional 900% comparing to SubprocVecEnv! and implementation can be found here.
Further improve GPU utilization, suggested here to increase env num to keep GPU busy.

archenroot commented 5 years ago

@dennywangtenk - sure I used the SubproceVecEnv, but did you test yourself before writing here?, some other things get broken and also sqlite is not storage for concurrency access, some other db yet doesn't work as Optuna doesn't have custom config (there is PR, but not released, you an build yourself...)

dennywangtenk commented 5 years ago

@archenroot , I got similar error, it seems due to both optuna and baselines SubproceVecEnv are both using multiprocessing has some conflicts. set n_jobs = 1, force optuna on sequential. May need to check SubproceVecEnv source code see if it's thread safe.

Ruben-E commented 5 years ago

@dennywangtenk does it actually make sense to set n_jobs to 1 and switch to SubprocVecEnv? Looks like all sub environment processes are using the same parameters for each trial when doing that. The goal of the optimize step is to find the optimal parameters and test with these as much as possible. Am I correct?

TheDoctorAI commented 5 years ago

I am now at Training for: 13144 time steps and my Titan GPU is still idling.

notadamking / RLTrader

Improve Utilization of GPU #10