jr-robotics / robo-gym

An open source toolkit for Distributed Deep Reinforcement Learning on real and simulated robots.
https://sites.google.com/view/robo-gym
MIT License
390 stars 74 forks source link

Training Overnight #25

Closed akeaveny closed 3 years ago

akeaveny commented 3 years ago

Hi @matteolucchi,

I need your help again!

My desktop has limited resources so I train overnight. My latest issue is that the robot server cannot communicate with the client after ~4 hours.

Here's the error message:

Traceback (most recent call last):
  File "train_TD3.py", line 311, in <module>
    main(args)
  File "train_TD3.py", line 274, in main
    model.learn(total_timesteps=int(config.TOTAL_TRAINING_ENV_STEPS), log_interval=10, callback=td3_callbacks)
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/td3/td3.py", line 196, in learn
    return super(TD3, self).learn(
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 251, in learn
    rollout = self.collect_rollouts(
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/off_policy_algorithm.py", line 418, in collect_rollouts
    if callback.on_step() is False:
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
    return self._on_step()
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 192, in _on_step
    continue_training = callback.on_step() and continue_training
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 88, in on_step
    return self._on_step()
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/callbacks.py", line 335, in _on_step
    episode_rewards, episode_lengths = evaluate_policy(
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/evaluation.py", line 46, in evaluate_policy
    obs = env.reset()
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/vec_env/vec_normalize.py", line 214, in reset
    obs = self.venv.reset()
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/vec_env/dummy_vec_env.py", line 61, in reset
    obs = self.envs[env_idx].reset()
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/stable_baselines3/common/monitor.py", line 86, in reset
    return self.env.reset(**kwargs)
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/gym/core.py", line 264, in reset
    observation = self.env.reset(**kwargs)
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/gym/core.py", line 289, in reset
    return self.env.reset(**kwargs)
  File "/home/akeaveny/anaconda3/envs/ROBO_GYM/lib/python3.8/site-packages/gym/wrappers/time_limit.py", line 25, in reset
    return self.env.reset(**kwargs)
  File "/home/akeaveny/git/robo-gym/robo_gym/envs/UWRTArm/UWRTArm.py", line 362, in reset
    rs_state = copy.deepcopy(np.nan_to_num(np.array(self.client.get_state_msg().state)))
AttributeError: 'UWRTArmSim' object has no attribute 'client'

Cheers, Aidan

matteolucchi commented 3 years ago

Hi!

Are you running robo-gym and the robot-server on the same pc or on 2 separate machines?

Could it be related to the the fact that after 4 hours the pc goes in sleep mode or something like that?

akeaveny commented 3 years ago

Same pc!

I don't believe my pc goes to sleep as I've ran other process overnight. I also disabled sleep mode earlier this week using sudo systemctl mask sleep.target suspend.target hibernate.target hybrid-sleep.target.

Aidan

matteolucchi commented 3 years ago

Do you have access to the output log from the Server Manager to see if some more info is given in there?

akeaveny commented 3 years ago

Hmm, I checked /home/akeaveny/robogym_ws/logs/uwrt_robot_server but couldn't find anything useful.

Was this what you had in mind?

matteolucchi commented 3 years ago

Mmm no, we introduced some time ago a logger, and you should have the logs under '/home/akeaveny/robogym_ws/src/robo-gym-robot-servers/logs/ at least for the ur robot server, this is initialised here https://github.com/jr-robotics/robo-gym-robot-servers/blob/35802004460600f2ea2d3f7d1b5205969c7f65a9/ur_robot_server/scripts/robot_server.py#L62 . I don't know if you have the same for your robot server but adding this to your robot server definitely will help you to see more information.

Now going back to our issue, we always train for long times (up to 48h) and we never had that specific issue.

Once we had an issue with the network card of pc that was going to 'sleep' and we just have leave a terminal open with a ping to google.com to solve that.

The weird thing here

File "/home/akeaveny/git/robo-gym/robo_gym/envs/UWRTArm/UWRTArm.py", line 362, in reset rs_state = copy.deepcopy(np.nan_to_num(np.array(self.client.get_state_msg().state))) AttributeError: 'UWRTArmSim' object has no attribute 'client'

is that it cannot find the client attribute and this is just an attribute of the object that was there all the time, most of the times if something goes wrong with the connection we see gRPC errors.

Have you always had the same error in multiple overnight trainings?

akeaveny commented 3 years ago

Thanks for this, I added this block to our robot_server.py :)

Yeah, I've had the same errors for two consective nights... Similar to your envs, I init UWRTArmSim which wraps our UWRTArmEnv here. Then connect to the Robot server here.

What's strange is that is gives this error ~4hrs into training each time, so my first guess was my pc was sleeping at this point.

matteolucchi commented 3 years ago

Yes, it is very strange indeed. Have you ever trained overnight with the same algorithm on other environments, for instance from the OpenAI Gym? This could help to understand if the error is related to robo-gym or if it is something related to the pc settings.

akeaveny commented 3 years ago

I'm going to close this as I don't think it's related to robo-gym. I verified that it isn't a connection issue as I ran it during the day, yesterday. It's strange because we ran the same script with OpenAI env & PyBullet here.

Cheers!

matteolucchi commented 3 years ago

Ok, I am sorry to hear that, I hope you can manage to fix the issue soon!