pat-coady / trpo

Trust Region Policy Optimization with TensorFlow and OpenAI Gym
https://learningai.io/projects/2017/07/28/ai-gym-workout.html
MIT License
360 stars 106 forks source link

KL, PolicyEntropy, PolicyLoss go to NaN after 31,455 episodes #21

Closed David-Clement-Senbionic closed 6 years ago

David-Clement-Senbionic commented 6 years ago

Hi there, I have created a variant of the HumanStandup-v2 environment in gym which has a much simpler simulated robot that is represented as a mujoco formatted xml file. I have tested this model both in mujoco and in gym and it seems to work fine. I tested the HumanStandup-v2 training on my hw/sw configuration and it worked well to 50,000 episodes. I then ran the identical setup with our robot model with the same reward function as the standard HumanStandup-v2. The only substantive difference between these two is the mujoco model. When I ran the training on our model I get:

Episode 31455, Mean R = 28911.0 Beta: 6.91 ExplainedVarNew: 0.913 ExplainedVarOld: 0.812 KL: nan PolicyEntropy: nan PolicyLoss: nan Steps: 672 ValFuncLoss: 114

Traceback (most recent call last): File "./train.py", line 334, in main(**vars(args)) File "./train.py", line 290, in main trajectories = run_policy(env, policy, scaler, logger, episodes=batch_size) File "./train.py", line 135, in run_policy observes, actions, rewards, unscaled_obs = run_episode(env, policy, scaler) File "./train.py", line 105, in runepisode obs, reward, done, = env.step(np.squeeze(action, axis=0)) File "/home/david/source/gym/gym/wrappers/monitor.py", line 31, in step observation, reward, done, info = self.env.step(action) File "/home/david/source/gym/gym/wrappers/time_limit.py", line 31, in step observation, reward, done, info = self.env.step(action) File "/home/david/source/gym/gym/envs/Senbionic/ballbotEnv.py", line 28, in step self.do_simulation(a, self.frame_skip) File "/home/david/source/gym/gym/envs/mujoco/mujoco_env.py", line 100, in do_simulation self.sim.step() File "source/mujoco-py/mujoco_py/mjsim.pyx", line 119, in mujoco_py.cymj.MjSim.step File "source/mujoco-py/mujoco_py/cymj.pyx", line 115, in mujoco_py.cymj.wrap_mujoco_warning.exit File "source/mujoco-py/mujoco_py/cymj.pyx", line 75, in mujoco_py.cymj.c_warning_callback File "/home/david/.conda/envs/gym35/lib/python3.5/site-packages/mujoco_py-1.50.1.53-py3.5.egg/mujoco_py/builder.py", line 319, in user_warning_raise_exception raise MujocoException('Got MuJoCo Warning: {}'.format(warn)) mujoco_py.builder.MujocoException: Got MuJoCo Warning: Unknown warning type Time = 0.0000.

I ran it again and it did the same thing at Episode 1280.

Any suggestions on how to approach overcoming this?

Many thanks for any advice..

pat-coady commented 6 years ago

David,

Without looking at this in more detail, my first suggestion would be to reduce learning rate on policy by 10x and see if it helps.

Sorry I haven't been able to look more carefully.

Pat

On Apr 18, 2018, at 9:37 PM, David Clement notifications@github.com wrote:

Hi there, I have created a variant of the HumanStandup-v2 environment in gym which has a much simpler simulated robot that is represented as a mujoco formatted xml file. I have tested this model both in mujoco and in gym and it seems to work fine. I tested the HumanStandup-v2 training on my hw/sw configuration and it worked well to 50,000 episodes. I then ran the identical setup with the same reward function as the standard HumanStandup-v2. The only substantive difference between these two is the mujoco model. When I rand the training on our model I get:

Episode 31455, Mean R = 28911.0 Beta: 6.91 ExplainedVarNew: 0.913 ExplainedVarOld: 0.812 KL: nan PolicyEntropy: nan PolicyLoss: nan Steps: 672 ValFuncLoss: 114

Traceback (most recent call last): File "./train.py", line 334, in main(**vars(args)) File "./train.py", line 290, in main trajectories = run_policy(env, policy, scaler, logger, episodes=batch_size) File "./train.py", line 135, in run_policy observes, actions, rewards, unscaled_obs = run_episode(env, policy, scaler) File "./train.py", line 105, in runepisode obs, reward, done, = env.step(np.squeeze(action, axis=0)) File "/home/david/source/gym/gym/wrappers/monitor.py", line 31, in step observation, reward, done, info = self.env.step(action) File "/home/david/source/gym/gym/wrappers/time_limit.py", line 31, in step observation, reward, done, info = self.env.step(action) File "/home/david/source/gym/gym/envs/Senbionic/ballbotEnv.py", line 28, in step self.do_simulation(a, self.frame_skip) File "/home/david/source/gym/gym/envs/mujoco/mujoco_env.py", line 100, in do_simulation self.sim.step() File "source/mujoco-py/mujoco_py/mjsim.pyx", line 119, in mujoco_py.cymj.MjSim.step File "source/mujoco-py/mujoco_py/cymj.pyx", line 115, in mujoco_py.cymj.wrap_mujoco_warning.exit File "source/mujoco-py/mujoco_py/cymj.pyx", line 75, in mujoco_py.cymj.c_warning_callback File "/home/david/.conda/envs/gym35/lib/python3.5/site-packages/mujoco_py-1.50.1.53-py3.5.egg/mujoco_py/builder.py", line 319, in user_warning_raise_exception raise MujocoException('Got MuJoCo Warning: {}'.format(warn)) mujoco_py.builder.MujocoException: Got MuJoCo Warning: Unknown warning type Time = 0.0000.

Any suggestions on how to approach overcoming this?

Many thanks for any advice..

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pat-coady/trpo/issues/21, or mute the thread https://github.com/notifications/unsubscribe-auth/AWdFxIuERGX7q3jkIuJRYUy4CbZMUSKUks5tp-pfgaJpZM4TbBCP.

David-Clement-Senbionic commented 6 years ago

It seemed to go away by reconfiguring the mujoco model parameters. I believe it was just mujoco hitting an "exploding" result causing a cascade effect.

pat-coady commented 6 years ago

Were you able to get your humanoid to stand up? If so, would love to see a video.

On Apr 26, 2018, at 1:24 PM, David Clement notifications@github.com wrote:

Closed #21 https://github.com/pat-coady/trpo/issues/21.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pat-coady/trpo/issues/21#event-1597083627, or mute the thread https://github.com/notifications/unsubscribe-auth/AWdFxOnlwqmtBmw8A3HUL-yGQoVhOg0Mks5tsgLCgaJpZM4TbBCP.

David-Clement-Senbionic commented 6 years ago

Hi Patrick, I only ran 50,000 episodes but it seemed to be working well.

https://youtu.be/KEHqKpSNuJ0

Cool stuff 😎

David

Sent from my iPhone

On Apr 27, 2018, at 4:18 AM, Patrick Coady notifications@github.com wrote:

Were you able to get your humanoid to stand up? If so, would love to see a video.

On Apr 26, 2018, at 1:24 PM, David Clement notifications@github.com wrote:

Closed #21 https://github.com/pat-coady/trpo/issues/21.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pat-coady/trpo/issues/21#event-1597083627, or mute the thread https://github.com/notifications/unsubscribe-auth/AWdFxOnlwqmtBmw8A3HUL-yGQoVhOg0Mks5tsgLCgaJpZM4TbBCP.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

ghost commented 5 years ago

It seemed to go away by reconfiguring the mujoco model parameters. I believe it was just mujoco hitting an "exploding" result causing a cascade effect.

@David-Clement-Senbionic Hi David, I also have the same problem, I create a new Mujoco humanoid model with human-like parameters but at some point my system is exploding like yours. How did you manage to reconfigure your mujoco model parameters? Also if it is possible could you share the code working? Did you change your reward function for standing up?

ghost commented 5 years ago

@David-Clement-Senbionic my mistake, the code is already shared :D but mujoco optimization is still an issue for me. Any help would be appreciated :)