pat-coady / trpo

Trust Region Policy Optimization with TensorFlow and OpenAI Gym
https://learningai.io/projects/2017/07/28/ai-gym-workout.html
MIT License
361 stars 106 forks source link

Nice code! But much nicer if parallelized #2

Closed garymcintire closed 6 years ago

garymcintire commented 6 years ago

Pat,

Love your code and algo here. But I'd really like to see it running in parallel.

You might want to take Schulmans trpo_mpi code from openai baselines and put your algo in it. I'm trying that with my algo and its working out well. But your algo is better.

I'm looking forward to using it parallelized

pat-coady commented 6 years ago

Hey Gary,

Thanks a lot. I'm on the fence about making it parallel since it is mostly an engineering exercise. It seems more important to develop algorithms that can learn robust policies in the fewest number of episodes. Since ultimately you'd like to train real robots, and episodes are expensive. Even with a Google robot training farm, the fewer episodes, the better.

All that said, I suspect I will do a parallel implementation soon. I just added a separate NN to my policy that learns variances and even covariances conditioned on actions and observations. Final simulations are running now. There are another couple policy networks I have in the queue to try.

I'll keep this open for now.

Pat

garymcintire commented 6 years ago

Pat,

Great to hear about your new algos. I look forward to them.

But, on parallelizations... I agree with you 100% on sample efficiency being the most important thing for real world use. But for developing those algos, the faster the better. If one is doing batch amenable algos, making them parallel is the best way to speed 'development' time.

Note the InvertedDoublePendlum-v1. https://gym.openai.com/envs/InvertedDoublePendulum-v1 I have the most sample efficient algo there, solving in 140 episodes with a DDPG variant, but it took so long to develop because it is not a batch algo and not very amenable to parallelization. It took many hours to run on that simple problem. It became really difficult to extend it to harder problems so I am very interested in fast development time.

Check out openai baselines trpo_mpi. I get 9 times speedup running 12 processes which fills my GPU 4GB memory. With 8GB memory I could run 24 processes and get maybe 15 times speedup. That code can easily be extended to multiple GPU cards and even multiple computers because he uses MPI(Message Passing Interface). The code is fairly simple and it'd be pretty easy to fit your algorithm into that shell.

Looking forward to your parallel implementation.

Gary

garymcintire commented 6 years ago

Pat,

I decided to make a parallel version of your code. You can find it here
https://github.com/garymcintire/mpi_util

I just accumulated the batches to process 0, let proc 0 do the weight updates, then broadcast the weights to all processes.

I experimented with just averaging the value function and policy weights of each process after each policy update, but that was slower both in wall clock and sample efficiency. See code in if statement of policy.py

Because the networks were so large on Humanoid-v1, (even a single process eventually ran out of GPU memory on my 4GB 1070 I added arguments to let the user specify the hidden lists for the 3 layer networks. It defaults to your original way.

I need someone(other than me) to clone it and try it. If you care to, let me know if it worked.

Gary

pat-coady commented 6 years ago

Hey Gary, I'm sorry for the slow reply, I'm very excited to test it out. I've just wrapped up my coursework and am buried with job interviews right now. But I'll try to sneak in a couple hours to look at this.

What sort of machine have you tested this on? What sort of wall clock time improvement are you seeing? After a quick check on my laptop, I'll get on AWS and fire up a 30 core machine and see how it does.

On the Humanoid, I actually think many of the observation dimensions are meaningless. If you look at the roboschool implementation of Humanoid, I think it only has 45 observation dimensions (vs. the MuJoCo) implementation. So I'm glad you added provisions to override layer sizes.

Pat

garymcintire commented 6 years ago

Pat,

I have found an issue with the MPI interacting with either tensorflow or python. In my code mpi_util.rank0_bcast_wts the broadcast seems to take extra time with each iteration such that after a few hours it takes more time than the gathering of episode data. Many seconds, even minutes

I'd be quite interested in what you find on a 30 core machine. If that's with a GPU, I doubt you will get that much speedup over 6 processes as the GPU card will be the limiting factor. My 1070 seems to max out about 6 processes. It is easy to put processes on second GPU cards or even multiple machines with MPI.

Here's a table with the following command python ../train.py Walker2d-v1 --nprocs 8 --gpu_pct 0.05 --policy_hid_list [20,20,20] --valfunc_hid_list [20,20,20] Note that it uses small networks with 20 in each hidden layer. changing the value of nprocs each time nprocs steps_per_sec works? 1 538 good 2 928 good 3 1214 good 4 1448 good 5 1500 fail 6 1800 fail 7 1750 fail 8 1850 fail 12 1600 fail The 'fail' means the MPI bcast is slowing with each iteration and will become seconds after a few hours. When it is working(good) it maintains a small value. Usually about 5 millisecs for this small net

If I let it default to the larger net that your code creates, it fails with nprocs=2

I'd be quite interested to get another data point from you, but I really can't release this unless I find a fix for the bcast problem.

If there are any MPI experts reading this, please give me some ideas.

Gary

garymcintire commented 6 years ago

Pat,

OK. Found my issue. bug in my cache code. Now works great.

python train.py Walker2d-v1 --nprocs 10 --gpu_pct 0.05  -n 2000

nprocs steps_per_sec reward_mean 1 546 641 # reward is highly variable because robot is highly variable 2 1040 544 3 1712 611 4 2014 1220 5 1950 1513 6 2100 860 7 2300 681 8 2400 452 9 2484 359 10 2339 367 11 2384 410 12 2169 336

So basically you can get a 4x wall clock speedup with this problem on my computer and GPU. Other problems will be better or worse.

Let me know when you get to try it out. https://github.com/garymcintire/mpi_util

Gary

pat-coady commented 6 years ago

Gary, again, sorry for the delay. I've finally wrapped up my studies and have been interviewing the last couple weeks. I think my last day of on-site interviews will be next week and then I'll attack my neglected TO-DO list.

hardmaru commented 6 years ago

Hi Pat, Gary,

Thanks for your PPO and batch PPO implementation.

I have been able to use Pat's PPO implementation to train on OpenAI's roboschool environments (which are open source, and don't require mujoco). They are also tougher to train compared to the original environments. I have also made it work for pybullet's environments as well (such as the racecar and minitaur).

I made my changes available in a fork (happy to submit a PR and merge back at some point):

https://github.com/hardmaru/trpo/tree/master/src

I made a few changes:

I have also played with Gary's MPI-based batch version of this PPO implementation, and found that while the easier tasks such as roboschool ant can be trained, the tougher tasks such as roboschool does not train in the MPI batch ppo. I suspect there's some issue with weight averaging, (or perhaps the log_std_var term) that is not being taken care of, will need to investigate more. All my tests have been conducted on 64-core VM's on google compute engine.

David

garymcintire commented 6 years ago

David,

Great that you found some use in my MPI code. I'm a little surprised it does not work with the harder roboschool problems. It does not average weights, but rather collects all the batches, trains networks, and then broadcasts those weights out. Could it be that some of the changes you made do not get broadcast out? That demo broadcasts out only what is in the value and policy graphs. Also, try it with nprocs 1 to see if it works that way. I also found Pat's code wouldn't work with RoboschoolHumanoid but I think the problem was that this humanoid has a really high cost for muscle inefficiency and made the rewards go negative which is probably why your making the initial STD smaller would make it work. RoboschoolFlagrun works great as is. I'd be real interested in the steps_sec you got on your 64 core machine with maybe nprocs = 1,.5, 10, 25, 64. steps_sec is printed out with each batch

Gary

hardmaru commented 6 years ago

Hi Gary,

Thanks for your reply. I was able to solve the RoboschoolHumanoid-v1 tasks with Pat's PPO model (using 1 process, and batch size of 20). I used 4 different sets of parameters (5x and 10x multiplier for network size, and also -1.0, -0.5 bias for log_std), and all 4 models trained the agent to successfully walk across the track. To reproduce the experiment, the commands are in "runall":

python train.py RoboschoolHumanoid-v1 --num_episodes 800000 -f 10 -n -1.0 > humanoid.log 2>&1 &

python train.py RoboschoolHumanoid-v1 --num_episodes 800000 -f 10 -n -0.5 > humanoid_noisy.log 2>&1 &

python train.py RoboschoolHumanoid-v1 --num_episodes 600000 -f 5 -n -1.0 > humanoid_lite.log 2>&1 &

python train.py RoboschoolHumanoid-v1 --num_episodes 800000 -f 5 -n -0.5 > humanoid_lite_noisy.log 2>&1 &

All of these trained over 2-3 days (after a weekend), after 300k steps or so. Didn't need to wait for it to finish. The weights of the policy network are continuous dumped as .json files during the training process for record keeping purpose.

To run the saved policies: python demo.py zoo/humanoid.json will launch the environment in render mode and use the pre-trained weights resulted from the policy network trained using train.py.

I'll do more digging into the MPI code, and look more closely at the broadcasting, to try to figure out why RoboschoolHumanoid-v1 wouldn't train. Let me know if you are able to train this environment using Pat's code with one-CPU process, and if it works in your MPI-extension too if you have a chance to try it out.

Thanks!

David

pat-coady commented 6 years ago

@garymcintire - I finally have some free time to catch up.

Re: RoboschoolHumanoid-v1

I remember running this environment without issue a few weeks ago. I don't think negative rewards should pose an problem. Anyway, re-running now to double-check.

pat-coady commented 6 years ago

@hardmaru

Hey David,

Yes, that would be great to do a PR at some point that incorporates adjusting hyper parameters from command line. 10x size for hidden layer 1 is definitely overkill for many environments. Especially the MuJoCo Humanoid which has a 377-dimension observation vector. I think most of those are dimensions contain no useful information, I think many are always be zero.

  1. Should all hidden layer sizes be optionally adjustable, or do you think it is sufficient to set hid1 and let rest scale appropriately?

  2. Do you have an example of how much faster training progressed with increased log_var?

  3. Probably should make layer1 of Q-value network also adjustable. Or perhaps just be a multiple of policy hid1 layer?

  4. Curious only: why did you choose to save models as .json vs checkpointing with TensorFlow?

pat-coady commented 6 years ago

After next batch of responses, I'll probably close this issue and open more specific issues for each item we are looking at. For distributed version, it probably makes sense to track on Gary's repo. So, I would create an issue that links to an issue over there. Unless anyone objects.

Pat

hardmaru commented 6 years ago

Hi Pat,

For RoboschoolHumanoid-v1, since the input size is only 41, a constant factor of 10x or 5x was sufficient. Making all hidden layer sizes adjustable (like Gary's version) would probably make more sense for a broader set of problems, including the older Mujoco Humanoid with too many dimensions (also the same with the Q-value network). But I think in general, I still prefer a default number (or a set of numbers) that seems to work for a large number of tasks, rather than too many options, so I like the way you chose a multiplicative factor as a heuristic rather than having the user hardcode a network size.

Setting the initial bias of log_var from -1.0 to 0.0 slowed down the training, but I find increasing it was required for some tasks in Roboschool (such walker2d) to get PPO to break out of its initial state where it just stays at the beginning of the track and simply avoids tripping over. It is probably also the case that leaving at -1.0 will just work for walker2d if we run the training a few times with other luckier random seeds ...

I chose .json file over checkpointing since I am doing other experiments at the moment and prefer to not rely on a single framework or library, so this allows me to transfer the trained models to other experiments. Also I had issues before checking in checkpointed binary files to github repos.

It makes sense to close this issue, and move it to Gary's repo for the parallel-ppo on RoboschoolHumanoid-v1 tasks.

Thanks,

David

pat-coady commented 6 years ago

Ok, going to close this. Have opened issue to add adjustable policy variance and hidden layer 1 size:

https://github.com/pat-coady/trpo/issues/6

Also, I only ran twice, but found RoboschoolHumanoid trained both times with both settings. Reaching mean reward of 1500 or so after 175k episodes. I did see one case where it was stuck on plateau of mean reward ~= 200 after 200k episodes, but seemed to be slowly improving. Not sure if it would have eventually broken free.

pat-coady commented 6 years ago

Going to move discussion of Gary's distributed training implementation to his repo:

https://github.com/garymcintire/mpi_util

Setting up AWS instance to test with multi-CPU.

hardmaru commented 6 years ago

Just for your info as a datapoint, I used your default settings to train roboschool humanoid, and got a score of ~ 3500 after letting it run for 800k episodes (this was on a GCE machine running an ubuntu instance)

atypic commented 6 years ago

@pat-coady @hardmaru do you happen to have a training curve for RoboschoolHumanoid-v1? I'm on episode 1e5 with a score of about 130...

hardmaru commented 6 years ago

Hi @atypic

I don't have the training curves anymore, but I've used my fork of @pat-coady's model and got it to work for RoboschoolHumanoid-v1. It should start to walk like a drunk person after 100k episodes or so.

atypic commented 6 years ago

@hardmaru thanks buddy, i'll clone your repo and give it a spin.

hardmaru commented 6 years ago

@atypic Here are the bash commands I used. At the bottom are 4 RoboschoolHumanoid-v1 experiments with different settings, and despite having the number of episodes set to be 800k or 600k, all worked (to some extent) after 100-200k episodes.

atypic commented 6 years ago

@hardmaru brilliant, thank you. I'll report my success later on :)

pat-coady commented 6 years ago

Sorry, I didn't keep any training curves. I did train on this environment, and recall the Humanoid did walk decently. But always seemed to veer off to the left. Could be there wasn't much reward for going straight down the track?

On Thu, Feb 1, 2018 at 7:43 AM, atypic notifications@github.com wrote:

@pat-coady https://github.com/pat-coady @hardmaru https://github.com/hardmaru do you happen to have a training curve for RoboschoolHumanoid-v1? I'm on episode 1e5 with a score of about 130...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pat-coady/trpo/issues/2#issuecomment-362254989, or mute the thread https://github.com/notifications/unsubscribe-auth/AWdFxOwjAIPQ8-eLI-zsWM7lCJCWyrTBks5tQbGLgaJpZM4OxEM5 .

atypic commented 6 years ago

@pat-coady i managed to get a decent walker with your code as well, so it works for sure.