Open jaekyeom opened 5 years ago
I think I've figured out the main cause: #3
But I'm still curious if deepmind_lab/deepmind/level_generation/compile_map.sh
calls are supposed to take that much time.
Regarding taking up all of GPU memory, this is the default of tensorflow. If you want to change this behavior, see here.
We also see on our end that map compilation takes significant time. Note that DMLab generates a new random maze at the beginning of each episode, which explains why we need to recompile the map frequently.
9 days of training is in line with what we have on our side. Note 10M steps are often enough to iterate on the algorithm, which takes 4.5 days.
We haven't spent a lot of time making the code more efficient. Contributions in this area are welcome (thanks for the PR, BTW). For instance, one might want to run more environments in parallel (flag --num_env
).
One way to speed up generation (at least when training is run multiple times) is by caching those generated maps and saving them on disk. We didn't experiment with this but @alex-petrenko did. Alex, did you get an improvement?
@nsavinov @jaekyeom Indeed there is a way to harness the existing level cache functionality of DMLab. Keep in mind, that this does not speed up the initial level generation, but allows you to re-use levels created in the past. E.g. if you start multiple training runs with a single environment seed, the ID of random mazes generated by DMLab will be the same and you will be able to take advantage of the cache. Otherwise, the cache does not help.
In this gist, you can find my DMLab gym wrapper with cache. It relies on some external code but should be easy to merge into any codebase: https://gist.github.com/alex-petrenko/c4fe6201749d9bb0cef5e642ce9a511a
I use another script to generate large quantities of levels before I do any actual training. I cannot share it now, but in a nutshell, the idea is to just create levels in a loop with the same seeds that you will later use for training. I'm planning to release all of this after I finish my current research project.
Please let me know if you have any other questions.
@alex-petrenko Great, thanks! By the way, what speed-up factor did you get by doing this? I.e. how much faster the training was after you pre-generated all the maps vs generation on the fly?
@nsavinov It's hard to tell because it depends on a particular experimental setup. In the particular case of your experiments for the ECR paper the benefit isn't that great, first of all, because you use a very long time horizon, plus the duration of all episodes is exactly the same, so episode reset happens simultaneously for all envs and in parallel. So I think in your case it's only a few % improvement.
In my case, I have an A2C-style implementation of PPO, where we wait for an env step to be completed in all parallel environments before generating the next action. On top of that, I used scenarios that ended when the agent collects a goal, so reset() calls happen randomly, and all environments have to wait for this single reset(). In this case, I saw more than a 50% speed increase from caching.
Thanks for the info, Alex!
Hello,
I'm trying to train the model on my local machine, with Titan XP GPUs and fairly many CPU cores (48).
But the issue is, the training is too slow. The fps values reported by the openai/baselines code are < 100, and a simple calculation says it will take 9 days to complete (i.e. 20M timesteps).
The process takes up most of the GPU memory, but I don't think it's utilizing the GPU actively. Also, despite the plenty of free CPU cores, its CPU utilization is really low (like a couple of cores). Its memory usage is around 20GB.
I tried both of
graphics=osmesa_or_egl
andgraphics=osmesa_or_glx
building Deepmind Lab (I usedxvfb-run
to execute the glx version), but there was no much difference.I even checked that the C++ function
Lab_init()
got therenderer='hardware'
argument.Another weird thing is that, to me (and htop), it seems like map generation takes a long time to finish. ADD: I measured the time spent for each
deepmind_lab/deepmind/level_generation/compile_map.sh
call. It's 9-13 seconds.The command I used is
python scripts/launcher_script.py --workdir=experiments --method=ppo_plus_eco --scenario=sparseplusdoors
.Is this a normal behavior?