Closed huangjiancong1 closed 5 years ago
Perf in OpenAI-Gym/HalfCheetah-v2
:
/home/jim/anaconda2/envs/clustering/lib/python3.5/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.24.2) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
{'config': 'config.ini', 'view': False, 'save_best': False, 'load': None, 'render': False, 'capture': False, 'test_only': False}
ENV: <TimeLimit<HalfCheetahEnv<HalfCheetah-v2>>>
State space: Box(17,)
- low: [-inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf -inf
-inf -inf -inf]
- high: [inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf inf]
Action space: Box(6,)
- low: [-1. -1. -1. -1. -1. -1.]
- high: [1. 1. 1. 1. 1. 1.]
Create agent with (nb_motors, nb_sensors) : 6 17
main algo : PeNFAC(lambda)-V
episode 0 total steps 0 last perf 0
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 0 1000 -47.115292 -1.8866550501 -47.1152922702 0.00000 0.20000 0 0.000 43.286 63.462
episode 100 total steps 100000 last perf 0.22751676727166603
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 100 1000 -35.538 -4.6774409211 -35.5381946659 4.48089 0.20000 5000 0.491 405.149 83.786
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 100 1000 -0.049 0.0439331957 -0.0487643295 4.48089 0.20000 5000 0.491 405.149 83.786
episode 200 total steps 200000 last perf 140.46418944639632
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 200 1000 204.978 56.9361475313 204.9779973113 74.57837 0.20000 5000 0.488 654.829 149.948
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 200 1000 48.583 3.4378766450 48.5834364685 74.57837 0.20000 5000 0.488 654.829 149.948
episode 300 total steps 300000 last perf 941.4459374037215
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 300 1000 319.603 86.4225904655 319.6026779233 307.82436 0.20000 5000 0.491 835.650 181.359
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 300 1000 649.431 123.0760117041 649.4310041797 307.82436 0.20000 5000 0.491 835.650 181.359
episode 400 total steps 400000 last perf 1998.9159226163288
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 400 1000 1541.706 137.0518755685 1541.7061888397 2129.72309 0.20000 5000 0.537 1031.988 213.218
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 400 1000 2010.571 136.1758744628 2010.5709348364 2129.72309 0.20000 5000 0.537 1031.988 213.218
episode 500 total steps 500000 last perf 678.2993231530708
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 500 1000 191.508 99.8058855712 191.5079652816 5862.61120 0.20000 5000 0.568 1227.944 245.120
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 500 1000 2799.371 232.6936888905 2799.3706928199 5862.61120 0.20000 5000 0.568 1227.944 245.120
episode 600 total steps 600000 last perf 3279.0396426752723
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : L 600 1000 2524.436 202.3936700988 2524.4356832294 9601.44991 0.20000 5000 0.552 1287.929 266.153
#INFO :/home/jim/ddrl/agent/cacla/src/pybinding/nfac.cpp.85 : T 600 1000 3428.879 278.4329341160 3428.8785063861 9601.44991 0.20000 5000 0.552 1287.929 266.153
For just ddpg for FetchcReach, his reward also always -50 and the performance in mujoco also not reach the desired_goal.
command:
python -m baselines.run --alg=ddpg --env=FetchReach-v1 --num_timesteps=5000 --play
(clustering) jim@jim-Inspiron-7577:~/baselines $ python -m baselines.run --alg=ddpg --env=FetchReach-v1 --num_timesteps=5000 --play
/home/jim/anaconda2/envs/clustering/lib/python3.5/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.24.2) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
Logging to /tmp/openai-2019-06-24-09-15-06-825990
env_type: robotics
2019-06-24 09:15:14.135417: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-06-24 09:15:14.388060: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-24 09:15:14.388312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: GeForce GTX 1050 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 3.95GiB freeMemory: 3.44GiB
2019-06-24 09:15:14.388329: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
Training ddpg on robotics:FetchReach-v1 with arguments
{'network': 'mlp'}
scaling actions by [1. 1. 1. 1.] before executing in env
setting up param noise
param_noise_actor/mlp_fc0/w:0 <- actor/mlp_fc0/w:0 + noise
param_noise_actor/mlp_fc0/b:0 <- actor/mlp_fc0/b:0 + noise
param_noise_actor/mlp_fc1/w:0 <- actor/mlp_fc1/w:0 + noise
param_noise_actor/mlp_fc1/b:0 <- actor/mlp_fc1/b:0 + noise
param_noise_actor/dense/kernel:0 <- actor/dense/kernel:0 + noise
param_noise_actor/dense/bias:0 <- actor/dense/bias:0 + noise
adaptive_param_noise_actor/mlp_fc0/w:0 <- actor/mlp_fc0/w:0 + noise
adaptive_param_noise_actor/mlp_fc0/b:0 <- actor/mlp_fc0/b:0 + noise
adaptive_param_noise_actor/mlp_fc1/w:0 <- actor/mlp_fc1/w:0 + noise
adaptive_param_noise_actor/mlp_fc1/b:0 <- actor/mlp_fc1/b:0 + noise
adaptive_param_noise_actor/dense/kernel:0 <- actor/dense/kernel:0 + noise
adaptive_param_noise_actor/dense/bias:0 <- actor/dense/bias:0 + noise
setting up actor optimizer
actor shapes: [[16, 64], [64], [64, 64], [64], [64, 4], [4]]
actor params: 5508
setting up critic optimizer
regularizing: critic/mlp_fc0/w:0
regularizing: critic/mlp_fc1/w:0
applying l2 regularization with 0.01
critic shapes: [[20, 64], [64], [64, 64], [64], [64, 1], [1]]
critic params: 5569
setting up target updates ...
target_actor/mlp_fc0/w:0 <- actor/mlp_fc0/w:0
target_actor/mlp_fc0/b:0 <- actor/mlp_fc0/b:0
target_actor/mlp_fc1/w:0 <- actor/mlp_fc1/w:0
target_actor/mlp_fc1/b:0 <- actor/mlp_fc1/b:0
target_actor/dense/kernel:0 <- actor/dense/kernel:0
target_actor/dense/bias:0 <- actor/dense/bias:0
setting up target updates ...
target_critic/mlp_fc0/w:0 <- critic/mlp_fc0/w:0
target_critic/mlp_fc0/b:0 <- critic/mlp_fc0/b:0
target_critic/mlp_fc1/w:0 <- critic/mlp_fc1/w:0
target_critic/mlp_fc1/b:0 <- critic/mlp_fc1/b:0
target_critic/output/kernel:0 <- critic/output/kernel:0
target_critic/output/bias:0 <- critic/output/bias:0
Using agent with the following configuration:
dict_items([('clip_norm', None), ('target_init_updates', [<tf.Operation 'group_deps_4' type=NoOp>, <tf.Operation 'group_deps_6' type=NoOp>]), ('critic_with_actor_tf', <tf.Tensor 'clip_by_value_3:0' shape=(?, 1) dtype=float32>), ('perturb_adaptive_policy_ops', <tf.Operation 'group_deps_1' type=NoOp>), ('return_range', (-inf, inf)), ('obs1', <tf.Tensor 'obs1:0' shape=(?, 16) dtype=float32>), ('perturbed_actor_tf', <tf.Tensor 'param_noise_actor/Tanh_2:0' shape=(?, 4) dtype=float32>), ('actor_tf', <tf.Tensor 'actor/Tanh_2:0' shape=(?, 4) dtype=float32>), ('memory', <baselines.ddpg.memory.Memory object at 0x7f33d5a49b00>), ('actor_optimizer', <baselines.common.mpi_adam.MpiAdam object at 0x7f33c0ad6e80>), ('normalize_observations', True), ('critic_optimizer', <baselines.common.mpi_adam.MpiAdam object at 0x7f341f6cdb70>), ('terminals1', <tf.Tensor 'terminals1:0' shape=(?, 1) dtype=float32>), ('batch_size', 64), ('actor_grads', <tf.Tensor 'concat:0' shape=(5508,) dtype=float32>), ('actor_loss', <tf.Tensor 'Neg:0' shape=() dtype=float32>), ('initial_state', None), ('stats_ops', [<tf.Tensor 'Mean_3:0' shape=() dtype=float32>, <tf.Tensor 'Mean_4:0' shape=() dtype=float32>, <tf.Tensor 'Mean_5:0' shape=() dtype=float32>, <tf.Tensor 'Sqrt_1:0' shape=() dtype=float32>, <tf.Tensor 'Mean_8:0' shape=() dtype=float32>, <tf.Tensor 'Sqrt_2:0' shape=() dtype=float32>, <tf.Tensor 'Mean_11:0' shape=() dtype=float32>, <tf.Tensor 'Sqrt_3:0' shape=() dtype=float32>, <tf.Tensor 'Mean_14:0' shape=() dtype=float32>, <tf.Tensor 'Sqrt_4:0' shape=() dtype=float32>]), ('actor', <baselines.ddpg.models.Actor object at 0x7f33c2709358>), ('stats_sample', None), ('target_Q', <tf.Tensor 'add_2:0' shape=(?, 1) dtype=float32>), ('critic', <baselines.ddpg.models.Critic object at 0x7f33c2709320>), ('param_noise_stddev', <tf.Tensor 'param_noise_stddev:0' shape=() dtype=float32>), ('action_noise', None), ('observation_range', (-5.0, 5.0)), ('target_soft_updates', [<tf.Operation 'group_deps_5' type=NoOp>, <tf.Operation 'group_deps_7' type=NoOp>]), ('critic_loss', <tf.Tensor 'add_15:0' shape=() dtype=float32>), ('target_critic', <baselines.ddpg.models.Critic object at 0x7f33c2709470>), ('stats_names', ['obs_rms_mean', 'obs_rms_std', 'reference_Q_mean', 'reference_Q_std', 'reference_actor_Q_mean', 'reference_actor_Q_std', 'reference_action_mean', 'reference_action_std', 'reference_perturbed_action_mean', 'reference_perturbed_action_std']), ('ret_rms', None), ('critic_tf', <tf.Tensor 'clip_by_value_2:0' shape=(?, 1) dtype=float32>), ('normalized_critic_with_actor_tf', <tf.Tensor 'critic_1/output/BiasAdd:0' shape=(?, 1) dtype=float32>), ('gamma', 0.99), ('action_range', (-1.0, 1.0)), ('adaptive_policy_distance', <tf.Tensor 'Sqrt:0' shape=() dtype=float32>), ('normalize_returns', False), ('reward_scale', 1.0), ('critic_target', <tf.Tensor 'critic_target:0' shape=(?, 1) dtype=float32>), ('param_noise', AdaptiveParamNoiseSpec(initial_stddev=0.2, desired_action_stddev=0.2, adoption_coefficient=1.01)), ('enable_popart', False), ('actions', <tf.Tensor 'actions:0' shape=(?, 4) dtype=float32>), ('critic_grads', <tf.Tensor 'concat_2:0' shape=(5569,) dtype=float32>), ('perturb_policy_ops', <tf.Operation 'group_deps' type=NoOp>), ('normalized_critic_tf', <tf.Tensor 'critic/output/BiasAdd:0' shape=(?, 1) dtype=float32>), ('obs_rms', <baselines.common.mpi_running_mean_std.RunningMeanStd object at 0x7f33c2709eb8>), ('actor_lr', 0.0001), ('critic_lr', 0.001), ('obs0', <tf.Tensor 'obs0:0' shape=(?, 16) dtype=float32>), ('critic_l2_reg', 0.01), ('rewards', <tf.Tensor 'rewards:0' shape=(?, 1) dtype=float32>), ('target_actor', <baselines.ddpg.models.Actor object at 0x7f33c246e940>), ('tau', 0.01)])
---------------------------------------------
| obs_rms_mean | 0.49 |
| obs_rms_std | 0.156 |
| param_noise_stddev | 0.164 |
| reference_action_mean | 0.029 |
| reference_action_std | 0.773 |
| reference_actor_Q_mean | -7.02 |
| reference_actor_Q_std | 0.745 |
| reference_perturbed_action_... | 0.033 |
| reference_perturbed_action_std | 0.781 |
| reference_Q_mean | -7.11 |
| reference_Q_std | 0.674 |
| rollout/actions_mean | 0.0536 |
| rollout/actions_std | 0.659 |
| rollout/episode_steps | 50 |
| rollout/episodes | 40 |
| rollout/Q_mean | -2.97 |
| rollout/return | -49.8 |
| rollout/return_history | -49.8 |
| rollout/return_history_std | 1.25 |
| rollout/return_std | 1.25 |
| total/duration | 12.2 |
| total/episodes | 40 |
| total/epochs | 1 |
| total/steps | 2e+03 |
| total/steps_per_second | 164 |
| train/loss_actor | 6.83 |
| train/loss_critic | 0.808 |
| train/param_noise_distance | 0.596 |
---------------------------------------------
---------------------------------------------
| obs_rms_mean | 0.502 |
| obs_rms_std | 0.146 |
| param_noise_stddev | 0.134 |
| reference_action_mean | 0.107 |
| reference_action_std | 0.784 |
| reference_actor_Q_mean | -11.5 |
| reference_actor_Q_std | 3.23 |
| reference_perturbed_action_... | 0.319 |
| reference_perturbed_action_std | 0.651 |
| reference_Q_mean | -11.8 |
| reference_Q_std | 2.89 |
| rollout/actions_mean | 0.0836 |
| rollout/actions_std | 0.686 |
| rollout/episode_steps | 50 |
| rollout/episodes | 80 |
| rollout/Q_mean | -6.57 |
| rollout/return | -49.8 |
| rollout/return_history | -49.8 |
| rollout/return_history_std | 0.972 |
| rollout/return_std | 0.972 |
| total/duration | 22.9 |
| total/episodes | 80 |
| total/epochs | 2 |
| total/steps | 4e+03 |
| total/steps_per_second | 175 |
| train/loss_actor | 12 |
| train/loss_critic | 1.98 |
| train/param_noise_distance | 0.326 |
---------------------------------------------
Running trained model
Creating window glfw
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
episode_rew=-50.0
It's expected that vanilla PeNFAC can't easily solve this task because of the sparse rewards (as well as vanilla ddpg, vanilla PPO, etc.).
I developed a "data augmentation" module for PeNFAC similar to HER (even if we can't talk of off-policy replay here). See 426b203d8537de9c254d86c1939d8dc112ca6c10.
If you want to use it you need to: 1) change your config.ini to use "libddrl-hpenfac.so" instead of "libddrl-penfac.so" 2) add the command line argument "--goal-based" when you call python run.py 3) add the hyperparameter "hindsight_nb_destination=5" to [agent] section in config.ini
Here's preliminary results on the environment you tried:
config.ini:
...
[agent]
gamma=0.98
decision_each=1
#policy
noise=0.2
gaussian_policy=1
hidden_unit_v=64:64
hidden_unit_a=64:64
momentum=0
actor_output_layer_type=2
hidden_layer_type=1
#learning
alpha_a=0.0001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3
stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true
hindsight_nb_destination=5
OK, What do you suggest us to do next? Use success_rate or last reward to compare?
@matthieu637
I modified the gym/run.py
file to output the success_rate like this:
while sample_steps_counter < total_max_steps + testing_each * max_steps:
if episode % display_log_each == 0:
success_rate = (results[-1]+max_steps)/max_steps if len(results) > 0 else 0
n_epoch = episode // display_log_each
print('n_epoch', n_epoch, 'success rate', success_rate)
writer.add_scalar(env_name+'success_rate_hpenfac', success_rate, n_epoch+1)
print('episode', episode, 'total steps', sample_steps_counter, 'last perf', results[-1] if len(results) > 0 else 0)
And the comparison of success_rate between ddpg+her and hpenfac with the original hyperparameters: :~$ python -m baselines.run --alg=her --env=FetchPush-v1 --num_timesteps=2.5e6 :~$ python run.py --goal-based
Here I used 2.5e6 total_max_steps and config.ini like this:
[simulation]
total_max_steps=2500000
testing_each=10
#number of trajectories for testing
testing_trials=10
dump_log_each=50
display_log_each=100
save_agent_each=100000
library=/home/jim/ddrl/agent/cacla/lib/libddrl-hpenfac.so
; env_name=RoboschoolHalfCheetah-v1
; env_name=HalfCheetah-v2
env_name=FetchPush-v1
[agent]
gamma=0.98
decision_each=1
#policy
noise=0.2
gaussian_policy=1
hidden_unit_v=64:64
hidden_unit_a=64:64
momentum=0
actor_output_layer_type=2
hidden_layer_type=1
#learning
alpha_a=0.0001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
reward_scale=1.0
vnn_from_scratch=false
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3
stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true
hindsight_nb_destination=5
I prepare to use the hyperparameters from ddpg+her like here in sec.2.2, can you share me some tips to modify your code to use the same hyperparameters?
On reach:
[simulation]
total_max_steps=2500000
testing_each=10
#number of trajectories for testing
testing_trials=10
dump_log_each=50
display_log_each=100
save_agent_each=100000
library=/home/jim/ddrl/agent/cacla/lib/libddrl-hpenfac.so
env_name=FetchReach-v1
[agent]
gamma=0.98
decision_each=1
#policy
noise=0.2
gaussian_policy=1
hidden_unit_v=64:64
hidden_unit_a=64:64
momentum=0
actor_output_layer_type=2
hidden_layer_type=1
#learning
alpha_a=0.0001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
reward_scale=1.0
vnn_from_scratch=false
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3
stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true
hindsight_nb_destination=5
:~$ python -m baselines.run --alg=her --env=FetchReach-v1 --num_timesteps=2.5e6
For PeNFAC, you're computing a success rate equivalent to "how many times I reached the goal within one episode", whereas in HER it only checks "if the goal was reached at the end of the episode". The two curves are not comparable: PeNFAC is penalized since the intermediate steps count as failures.
@matthieu637 Did you mean I can use the last perf from every test_episode to calculate the success_rate?
@matthieu637
I use this python script to plot the success_rate from 0.1.monitor.csv
:
import numpy as np
from numpy import genfromtxt
import matplotlib.pyplot as plt
import matplotlib
matplotlib.rcParams.update({'font.size': 12})
plt.rcParams["font.family"] = "Time New Roman"
episodes = 800
epochs = 200
env = 'FetchPush-v1'
data = genfromtxt('0.1.monitor.csv', delimiter=',')
is_success = data[:,3][1:len(data)]
to_epoch = is_success.reshape(epochs,episodes)
x,y =[],[]
for epoch, last_prefs in enumerate(to_epoch):
success_rate = np.sum(last_prefs) / episodes
x = x + [epoch]
y = y + [success_rate]
plt.figure(figsize=(15,10))
plt.plot(x, y, marker='o', linestyle='-', markersize=2, linewidth=1, label='hpenfac')
plt.xlabel('n_epoch')
plt.ylabel('success rate')
plt.title(env)
plt.legend(loc=2)
plt.savefig(env+'.png')
plt.show()
[simulation]
total_max_steps=8000000
testing_each=10
#number of trajectories for testing
testing_trials=10
dump_log_each=50
display_log_each=100
save_agent_each=100000
library=/home/jim/ddrl/agent/cacla/lib/libddrl-hpenfac.so
env_name=FetchPush-v1
[agent]
gamma=0.98
decision_each=1
#policy
noise=0.2
gaussian_policy=1
hidden_unit_v=64:64
hidden_unit_a=64:64
momentum=0
actor_output_layer_type=2
hidden_layer_type=1
#learning
alpha_a=0.0001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
reward_scale=1.0
vnn_from_scratch=false
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3
stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true
hindsight_nb_destination=5
[simulation]
total_max_steps=8000000
testing_each=10
#number of trajectories for testing
testing_trials=10
dump_log_each=50
display_log_each=100
save_agent_each=100000
library=../agent/cacla/lib/libddrl-hpenfac.so
env_name=FetchPush-v1
[agent]
gamma=0.98
decision_each=1
#policy
noise=0.2
gaussian_policy=1
hidden_unit_v=256:256:256
hidden_unit_a=256:256:256
momentum=0
actor_output_layer_type=2
hidden_layer_type=3
#learning
alpha_a=0.001
alpha_v=0.001
batch_norm_actor=7
batch_norm_critic=0
reward_scale=1.0
vnn_from_scratch=false
update_critic_first=true
number_fitted_iteration=10
stoch_iter_critic=1
lambda=0.9
gae=true
update_each_episode=3
stoch_iter_actor=1
beta_target=0.03
ignore_poss_ac=false
conserve_beta=true
disable_cac=false
disable_trust_region=true
hindsight_nb_destination=5
That's strange, using your python script, here's what I've got (64x64 units):
Have you used the last version and rebuild the ddrl libraries? The only difference we have in config.ini is that in my case, I've got:
...
testing_each=1
testing_trials=1
...
Are you sure it is training in FetchPush not in FetchReach?
env_name=FetchPush-v1
in config.ini,
If FetchPush has this performance, it is very unbelievable, here is the paper's performance:
My bad I'm talking about FetchReach-v1. For Reach I guess you have to start to optimize the hyperparameters.
Thx @matthieu637, I can use DDPG+HER to see the performance in FetchReach later. But they haven't compared with FetchReach in paper(https://arxiv.org/abs/1707.01495), I think hard to compare because its performance changed a lot and faster go to 100% success rate.
I don't understand what you mean in the second sentence. I am manually to change the hyperparameters for FetchPush, I study how to use lhpo. But I worrying I can not finish the hyper-optimize before the deadline.
@matthieu637 For the FetchReach-v1 task with DDPG+HER, it is supper outpreformance, it only need 4epochsx10episodex50timesteps = 2000 total_timesteps can have 100% test_success_rate
jim@jim-Inspiron-7577:~/baselines $ python -m baselines.run --alg=her --env=FetchReach-v1 --num_timesteps=8e5 --n_cycles=10
/home/jim/anaconda2/envs/clustering/lib/python3.5/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.24.2) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
Logging to /tmp/openai-2019-06-28-20-56-42-334498
env_type: robotics
2019-06-28 20:56:43.049827: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-06-28 20:56:43.051216: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2019-06-28 20:56:43.051249: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: jim-Inspiron-7577
2019-06-28 20:56:43.051259: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: jim-Inspiron-7577
2019-06-28 20:56:43.051294: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 410.78.0
2019-06-28 20:56:43.051319: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module 410.78 Sat Nov 10 22:09:04 CST 2018
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
"""
2019-06-28 20:56:43.051335: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 410.78.0
2019-06-28 20:56:43.051343: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:300] kernel version seems to match DSO: 410.78.0
Training her on robotics:FetchReach-v1 with arguments
{'network': 'mlp', 'n_cycles': 10}
T: 50
_Q_lr: 0.001
_action_l2: 1.0
_batch_size: 256
_buffer_size: 1000000
_clip_obs: 200.0
_hidden: 256
_layers: 3
_max_u: 1.0
_network_class: baselines.her.actor_critic:ActorCritic
_norm_clip: 5
_norm_eps: 0.01
_pi_lr: 0.001
_polyak: 0.95
_relative_goals: False
_scope: ddpg
aux_loss_weight: 0.0078
bc_loss: 0
ddpg_params: {'batch_size': 256, 'max_u': 1.0, 'action_l2': 1.0, 'network_class': 'baselines.her.actor_critic:ActorCritic', 'norm_clip': 5, 'polyak': 0.95, 'buffer_size': 1000000, 'layers': 3, 'clip_obs': 200.0, 'scope': 'ddpg', 'norm_eps': 0.01, 'hidden': 256, 'relative_goals': False, 'pi_lr': 0.001, 'Q_lr': 0.001}
demo_batch_size: 128
env_name: FetchReach-v1
gamma: 0.98
make_env: <function prepare_params.<locals>.make_env at 0x7f12f5fac510>
n_batches: 40
n_cycles: 10
n_test_rollouts: 10
noise_eps: 0.2
num_demo: 100
prm_loss_weight: 0.001
q_filter: 0
random_eps: 0.3
replay_k: 4
replay_strategy: future
rollout_batch_size: 1
test_with_polyak: False
*** Warning ***
You are running HER with just a single MPI worker. This will work, but the experiments that we report in Plappert et al. (2018, https://arxiv.org/abs/1802.09464) were obtained with --num_cpu 19. This makes a significant difference and if you are looking to reproduce those results, be aware of this. Please also refer to https://github.com/openai/baselines/issues/314 for further details.
****************
Creating a DDPG agent with action space 4 x 1.0...
Training...
---------------------------------
| epoch | 0 |
| stats_g/mean | 0.914 |
| stats_g/std | 0.107 |
| stats_o/mean | 0.271 |
| stats_o/std | 0.0339 |
| test/episode | 10 |
| test/mean_Q | -0.356 |
| test/success_rate | 0.6 |
| train/episode | 10 |
| train/success_rate | 0 |
---------------------------------
---------------------------------
| epoch | 1 |
| stats_g/mean | 0.885 |
| stats_g/std | 0.112 |
| stats_o/mean | 0.264 |
| stats_o/std | 0.0351 |
| test/episode | 20 |
| test/mean_Q | -1.02 |
| test/success_rate | 0.7 |
| train/episode | 20 |
| train/success_rate | 0.7 |
---------------------------------
---------------------------------
| epoch | 2 |
| stats_g/mean | 0.881 |
| stats_g/std | 0.11 |
| stats_o/mean | 0.263 |
| stats_o/std | 0.035 |
| test/episode | 30 |
| test/mean_Q | -0.579 |
| test/success_rate | 1 |
| train/episode | 30 |
| train/success_rate | 0.8 |
---------------------------------
---------------------------------
| epoch | 3 |
| stats_g/mean | 0.874 |
| stats_g/std | 0.107 |
| stats_o/mean | 0.261 |
| stats_o/std | 0.0343 |
| test/episode | 40 |
| test/mean_Q | -0.553 |
| test/success_rate | 1 |
| train/episode | 40 |
| train/success_rate | 0.8 |
---------------------------------
---------------------------------
| epoch | 4 |
| stats_g/mean | 0.874 |
| stats_g/std | 0.103 |
| stats_o/mean | 0.261 |
| stats_o/std | 0.0335 |
| test/episode | 50 |
| test/mean_Q | -0.526 |
| test/success_rate | 1 |
| train/episode | 50 |
| train/success_rate | 1 |
---------------------------------
Fixed in last commit.
Produced without hyperoptimization params:
[simulation]
total_max_steps = 2000000
testing_each = 100
testing_trials = 40
dump_log_each = 1
display_log_each = 200
save_agent_each = 10000000
library = ..../ddrl/agent/cacla/lib/libddrl-hpenfac.so
env_name=FetchPush-v1
[agent]
gamma = 0.98
noise = 0.35
gaussian_policy = 1
hidden_unit_v = 256:256:256
hidden_unit_a = 256:256:256
actor_output_layer_type = 2
hidden_layer_type = 3
alpha_a = 0.0005
alpha_v = 0.001
number_fitted_iteration = 10
stoch_iter_critic = 1
lambda = 0.6
gae = true
update_each_episode = 40
stoch_iter_actor = 10
beta_target = 0.03
ignore_poss_ac = false
conserve_beta = true
disable_cac = false
disable_trust_region = true
hindsight_nb_destination = 3
@matthieu637 Hi, Mat. Do you remember that which trick you used that make Hindsight Augmentation work with PeNFAC?
It seems that the last performance hard to increased when using
FetchPush-v1
, see the bottom of the output