Too slow training on GPU

jaekyeom commented 5 years ago

Hello,

I'm trying to train the model on my local machine, with Titan XP GPUs and fairly many CPU cores (48).

But the issue is, the training is too slow. The fps values reported by the openai/baselines code are < 100, and a simple calculation says it will take 9 days to complete (i.e. 20M timesteps).

The process takes up most of the GPU memory, but I don't think it's utilizing the GPU actively. Also, despite the plenty of free CPU cores, its CPU utilization is really low (like a couple of cores). Its memory usage is around 20GB.

I tried both of graphics=osmesa_or_egl and graphics=osmesa_or_glx building Deepmind Lab (I used xvfb-run to execute the glx version), but there was no much difference.

I even checked that the C++ function Lab_init() got the renderer='hardware' argument.

Another weird thing is that, to me (and htop), it seems like map generation takes a long time to finish. ADD: I measured the time spent for each deepmind_lab/deepmind/level_generation/compile_map.sh call. It's 9-13 seconds.

The command I used is python scripts/launcher_script.py --workdir=experiments --method=ppo_plus_eco --scenario=sparseplusdoors.

Is this a normal behavior?

jaekyeom commented 5 years ago

I think I've figured out the main cause: #3

But I'm still curious if deepmind_lab/deepmind/level_generation/compile_map.sh calls are supposed to take that much time.

RaphaelMarinier commented 5 years ago

Regarding taking up all of GPU memory, this is the default of tensorflow. If you want to change this behavior, see here.

We also see on our end that map compilation takes significant time. Note that DMLab generates a new random maze at the beginning of each episode, which explains why we need to recompile the map frequently.

9 days of training is in line with what we have on our side. Note 10M steps are often enough to iterate on the algorithm, which takes 4.5 days.

We haven't spent a lot of time making the code more efficient. Contributions in this area are welcome (thanks for the PR, BTW). For instance, one might want to run more environments in parallel (flag --num_env).

nsavinov commented 5 years ago

One way to speed up generation (at least when training is run multiple times) is by caching those generated maps and saving them on disk. We didn't experiment with this but @alex-petrenko did. Alex, did you get an improvement?

alex-petrenko commented 5 years ago

@nsavinov @jaekyeom Indeed there is a way to harness the existing level cache functionality of DMLab. Keep in mind, that this does not speed up the initial level generation, but allows you to re-use levels created in the past. E.g. if you start multiple training runs with a single environment seed, the ID of random mazes generated by DMLab will be the same and you will be able to take advantage of the cache. Otherwise, the cache does not help.

In this gist, you can find my DMLab gym wrapper with cache. It relies on some external code but should be easy to merge into any codebase: https://gist.github.com/alex-petrenko/c4fe6201749d9bb0cef5e642ce9a511a

I use another script to generate large quantities of levels before I do any actual training. I cannot share it now, but in a nutshell, the idea is to just create levels in a loop with the same seeds that you will later use for training. I'm planning to release all of this after I finish my current research project.

Please let me know if you have any other questions.

nsavinov commented 5 years ago

@alex-petrenko Great, thanks! By the way, what speed-up factor did you get by doing this? I.e. how much faster the training was after you pre-generated all the maps vs generation on the fly?

alex-petrenko commented 5 years ago

@nsavinov It's hard to tell because it depends on a particular experimental setup. In the particular case of your experiments for the ECR paper the benefit isn't that great, first of all, because you use a very long time horizon, plus the duration of all episodes is exactly the same, so episode reset happens simultaneously for all envs and in parallel. So I think in your case it's only a few % improvement.

In my case, I have an A2C-style implementation of PPO, where we wait for an env step to be completed in all parallel environments before generating the next action. On top of that, I used scenarios that ended when the agent collects a goal, so reset() calls happen randomly, and all environments have to wait for this single reset(). In this case, I saw more than a 50% speed increase from caching.

nsavinov commented 5 years ago

Thanks for the info, Alex!

google-research / episodic-curiosity

Too slow training on GPU #2