jr-robotics / robo-gym

An open source toolkit for Distributed Deep Reinforcement Learning on real and simulated robots.
https://sites.google.com/view/robo-gym
MIT License
426 stars 75 forks source link

hints for a manipulator arm #33

Closed damounayman closed 3 years ago

damounayman commented 3 years ago

Hi, I am really interested in your solution. We are trying to make a Benchmark of existing solutions for Closing the Reality Gap in Sim2Real Transfer for Robotics. In this context, we use ur3e to test the existing solution. For a quick test, I ran this code, but I am far from having a convergence. Can you give me hints, algos, or guide me to examples for a manipulator arm.

import robo_gym
from robo_gym.wrappers.exception_handling import ExceptionHandling
import gym
#gym.logger.set_level(40)
from stable_baselines import DDPG
from stable_baselines.ddpg.policies import MlpPolicy
from datetime import datetime
# specify the ip of the machine running the robot-server
target_machine_ip = '127.0.0.1'

# initialize environment (to render the environment set gui=True)
env = gym.make('EndEffectorPositioningURSim-v0', ip=target_machine_ip, gui=True)
env = ExceptionHandling(env)
model = DDPG(MlpPolicy, env,verbose=1)
# follow the instructions provided by stable-baselines

num_episodes = 60
start_time = datetime.now()

for episode in range(num_episodes):
    print("runing the episode numbre",episode)
    model.learn(total_timesteps=int(15000))
    #saving and loading a model
    model.save("ddpg_ur3e")
    del model
    print("load the model",episode)
    model = DDPG.load("ddpg_ur3e", env=env, policy=MlpPolicy)
    print('Duration: {}'.format(datetime.now() - start_time))

print('Duration: {}'.format(datetime.now() - start_time))
env.kill_sim()
friedemannzindler commented 3 years ago

Hi Did you try other algorithms already?

I tried with the stable-baselines3 implementation of DDPG and can verify the issue of it not really learning the task. However I don't have any idea right now on why it does not work.

We used D4PG to complete this task. For a reference implementation I recommend: https://github.com/schatty/d4pg-pytorch

damounayman commented 3 years ago

Thanks for answering, I have implemented DDPG with stable-baselines. With small modifications in my script, the robot starts to avoid collision and reach the objective, but the average reward remains negative even after 1e6 timesteps of DDPG. Here is the reward graph. I will test other algorithms and D4PG that you mentioned. thanks again

import datetime
import os, sys
import numpy as np
import gym
import robo_gym
from robo_gym.wrappers.exception_handling import ExceptionHandling
from stable_baselines import DDPG
from stable_baselines.ddpg.policies import MlpPolicy
from stable_baselines import results_plotter
from stable_baselines.bench import Monitor
from stable_baselines.results_plotter import load_results, ts2xy
from stable_baselines.common.noise import AdaptiveParamNoiseSpec
from stable_baselines.common.callbacks import BaseCallback
import matplotlib.pyplot as plt

#gym.logger.set_level(40)

# specify the ip of the machine running the robot-server
target_machine_ip = '127.0.0.1'
log_dir = os.getcwd()

# Train the agent
time_steps = 2000000
best_mean_reward, n_steps = -np.inf, 0

def callback(_locals, _globals):
  """
  Callback called at each step for DDPG
  :param _locals: (dict)
  :param _globals: (dict)
  """
  global n_steps, best_mean_reward, num_best_model
  # Print stats every 1000 calls
  if (n_steps + 1) % 1000 == 0:
      # Evaluate policy performance
      x, y = ts2xy(load_results(log_dir), 'timesteps')
      if len(x) > 0:
          mean_reward = np.mean(y[-100:])
          print(x[-1], 'timesteps')
          print("Best mean reward: {:.2f} - Last mean reward per episode: {:.2f}".format(best_mean_reward, mean_reward))
          print('Duration: {}'.format(datetime.datetime.now() - start_time))
          # New best model, you could save the agent here
          if mean_reward > best_mean_reward:
              best_mean_reward = mean_reward
              # Example for saving best model
              print("Saving new best model")
              num_best_model+=1
              _locals['self'].save(log_dir + '/best_model_'+str(num_best_model)+'.pkl')
  n_steps += 1
  #plt.show()
  if (n_steps + 1) % 5000 == 0:
      results_plotter.plot_results([log_dir], time_steps, results_plotter.X_TIMESTEPS, "End Effector Positioning UR3e DDPG")
      plt.savefig('End_Effector_Positioning_UR3e_DDPG'+str(num_best_model+5)+'.png')
  return True

# initialize environment (to render the environment set gui=True)
env = gym.make('EndEffectorPositioningURSim-v0', ip=target_machine_ip, gui=False)
env = ExceptionHandling(env)
env = Monitor(env, log_dir, allow_early_resets=True)
model = DDPG.load("best_model_105.pkl", env=env, policy=MlpPolicy)

# use stable-baselines
start_time = datetime.datetime.now()
num_best_model = 105
model.learn(total_timesteps=time_steps, callback=callback)

results_plotter.plot_results([log_dir], time_steps, results_plotter.X_TIMESTEPS, "End Effector Positioning UR3e DDPG")
#plt.show()
plt.savefig('End_Effector_Positioning_UR3e_DDPG.png')
env.kill_sim()

End_Effector_Positioning_UR3e_DDPG129

damounayman commented 3 years ago

Thank you for sharing the libraries used. I managed to have a more rapid convergence with other algorithms such as TD3. you can close the issue.

friedemannzindler commented 3 years ago

Im glad that it worked out with other algorithms. Also I would be really interested in your results once you are done!