FetchPickAndPlace not training using DDPG+HER

fisherxue commented 5 years ago

I am trying to train FetchPickAndPlace as per https://arxiv.org/pdf/1802.09464.pdf using DDPG+HER, however, regardless of how long I train, agent fails to learn anything. I saw that #198 mentioned that OpenAI used a number of tricks to get it to work. Has anyone had any luck doing so in stable baselines? Thanks!

FetchReach and FetchPush both train fine.

My current hyperparameters:

FetchPickAndPlace-v1:
  n_timesteps: !!float 5e6
  policy: 'MlpPolicy'
  model_class: 'ddpg'
  n_sampled_goal: 4
  goal_selection_strategy: 'future'
  buffer_size: 1000000
  batch_size: 256
  gamma: 1.0
  critic_l2_reg: 1.0
  observation_range: [-200.0, 200.0]
  random_exploration: 0.3
  actor_lr: !!float 1e-3
  critic_lr: !!float 1e-3
  noise_type: 'normal'
  noise_std: 0.2
  normalize_observations: true
  normalize_returns: false
  policy_kwargs: "dict(layers=[256, 256, 256])"

araffin commented 5 years ago

Hello, The main trick is the number of workers, how many did you use?

NOTE: from the documentation " If you want to reproduce results from the paper, please use the rl baselines zoo in order to have the correct hyperparameters and at least 8 MPI workers with DDPG."

fisherxue commented 5 years ago

Using 8 workers on 8-core machine, but it's not learning still (success rate 0.02).

I'm running with: mpirun -np 8 python train.py --algo her --env FetchPickAndPlace-v1 using rl-baselines-zoo.

araffin commented 5 years ago

Ok, so it seems it is not enough for harder env, openai used 19 to produce results in their paper...

araffin commented 5 years ago

From a previous version of the doc (before refactoring): https://github.com/hill-a/stable-baselines/blob/026e0528d968fafdbe2f017676c99cc3d0fbd10d/docs/modules/her.rst

" In order to reproduce the results from Plappert et al. (2018) [..] This will require a machine with sufficient amount of physical CPU cores. In our experiments, we used Azure's D15v2 instances, which have 20 physical cores. We only scheduled the experiment on 19 of those to leave some head-room on the system. "

fisherxue commented 5 years ago

Thanks, I'll try it with more workers!

However, I also tried it using baselines with this command:

python -m baselines.run --num_env 2 --alg=her --env=FetchPickAndPlace-v1 --num_timesteps=5.0e6

This gave me a success rate of 1:

---------------------------------
| epoch              | 677      |
| stats_g/mean       | 0.851    |
| stats_g/std        | 0.107    |
| stats_o/mean       | 0.205    |
| stats_o/std        | 0.115    |
| test/episode       | 1.36e+04 |
| test/mean_Q        | -1.32    |
| test/success_rate  | 1        |
| train/episode      | 6.78e+04 |
| train/success_rate | 0.59     |
---------------------------------

Meanwhile, with:

mpirun -np 8 python train.py --algo her --env FetchPickAndPlace-v1

I get a 1% success rate.

I'm wondering if the issue may be with different hyperparameter options.

They have:

Actor and critic networks: 3 layers with 256units each and ReLU non-linearities
Adam optimizer (Kingma and Ba, 2014) with 1·10e^-3 for training both actor and critic
Buffer size: 10^6 transitions
Polyak-averaging coefficient:0.95***
Action L2 norm coefficient:1.0
Observation clipping:[−200,200]
Batch size:256
Rollouts per MPI worker:2***
Number of MPI workers:19
Cycles per epoch:50
Batches per cycle:40
Test rollouts per epoch:10
Probability of random actions:0.3
Scale of additive Gaussian noise:0.2
Probability of HER experience replay:0.8
Normalized clipping:[−5,5]

*** are the ones I'm not sure how to implement in stable baselines.

araffin commented 5 years ago

Yes, they used a custom ddpg version and it s hard to make the correspondance (cf issue you mentioned at the beginning). However, if you find a bug or the trick that makes it work with less workers then we would be happy to add that.

Edit: the success rate displayed for sb is the train one, the test one should be higher

fisherxue commented 5 years ago

Update: trained with 20 workers, but still no luck (-50 reward, fails >99% of the time during test). Any tips?

araffin commented 5 years ago

I suspect either a subtle bug in the implementation (because the implementation still works on other envs) or the additional tricks/hyperparameters of openai that make it work. One that was not implemented is the l2 penalty on the action. They also use a different formulation of HER, creating new transitions only when sampling.

araffin commented 5 years ago

If you have some time, the best way would be to remove each trick from the OpenAI baselines repo until it breaks (unfortunately, I don't have the resources to do that on my own...)

fisherxue commented 5 years ago

Will do, I'll get back to you in a few weeks (hardware I would run it on is currently occupied) :)

RyanRizzo96 commented 4 years ago

@araffin quick question, why is the test success rate higher than the train? Doesn't this mean there is sampling bias in your test?

Edit: the success rate displayed for sb is the train one, the test one should be higher

Thank you in advance!

araffin commented 4 years ago

During testing, all the exploration noise is removed, we use a deterministic policy, hence the difference.

151levym commented 4 years ago

I'm curious if anyone ever found the solution to this. I'm working on a very similar environment to Fetch and getting about 80% success in the regular baselines library, but 0 with stable baselines. I was wondering if it has to do with VecNormalize not being implemented in HER or something entirely different.

araffin commented 4 years ago

An update here: it seems that HER + SAC is working (with only worker) on Fetch Pick and Place (and others), you can find trained agents in the rl zoo See PR https://github.com/araffin/rl-baselines-zoo/pull/53

araffin commented 3 years ago

Fixed in SB3, results and hyperparameters are available in the zoo: https://github.com/DLR-RM/rl-baselines3-zoo

hill-a / stable-baselines

FetchPickAndPlace not training using DDPG+HER #424