PandaPush-v2 does not learn with SB3

qgallouedec / panda-gym

Set of robotic environments based on PyBullet physics engine and gymnasium.

MIT License

564 stars 118 forks source link

PandaPush-v2 does not learn with SB3 #21

Closed shukla-yash closed 2 years ago

shukla-yash commented 2 years ago

Hi,

I am trying to recreate your results from the paper 'panda-gym: Open-source goal-conditioned environments for robotic learning', and the code given in train_push.py does not seem to work with the default parameters. Can you point me to the RL code you used to get those results? Also, are the learning curves in the paper from Sparse reward setting or Dense? Thanks!

qgallouedec commented 2 years ago

Hi,

What do you mean by "does not work" ? Is an error raised during code execution ? Or do you mean that the results that you obtain does not match with the curve on the paper ?

The code I used for the paper is provided in this openai/baselines fork. It will allow you to reproduce strictly the results of the paper.

Nevertheless, since openai has stopped maintaining its repo, I strongly advise you to use RL code maintained like stable-baselines3, even if you will probably not be able to reproduce exactly the results of the paper.

If you still want to use OpenAI/baselines to reproduce strictly my results, please note that I used the v0 version of panda-gym (and not the v2 version I released in the meantime). The changes between these two versions won't change the curves much I think, but I can't guarantee it.

shukla-yash commented 2 years ago

Thanks for your reply. By does not work, I meant the learning curves did not match (No error in execution). I trained for almost 3x10^6 timesteps, but the success rate for PandaPush-v2 was stuck at 0.15 (The learning curves in the paper converge to a success rate ~ 1).

Thanks for your suggestions, I will try them in the meantime. Did you use sparse reward for the curves?

qgallouedec commented 2 years ago

Did you use sparse reward for the curves?

I did.

You can also check the baselines results on the rl-baselines3-zoo repo. For Push, convergence occurs well before 1e6 timesteps.

shukla-yash commented 2 years ago

Can you please post a snippet for PandaPickAndPlace-v2 that learns using DDPG from SB3, to reproduce the results in the paper? I realize it might not be exactly equivalent with the results from the paper, but anything that learns should work for me

I've tried this, but it does not work:

`env = gym.make("PandaPickAndPlace-v2") env = make_vec_env(lambda: env, n_envs=4)

model = DDPG(policy="MultiInputPolicy", env=env, replay_buffer_class=HerReplayBuffer, verbose=1, batch_size= 2048, buffer_size=1000000)

model.learn(total_timesteps=4000000)`

Thanks!

qgallouedec commented 2 years ago

You can use rl-baselines3-zoo to train PandaPush-v2. You just need to paste these hyperparameters in hyperparams/ddpg.yml:

PandaPush-v2:
  env_wrapper: sb3_contrib.common.wrappers.TimeFeatureWrapper
  n_timesteps: !!float 1e6
  policy: 'MultiInputPolicy'
  buffer_size: 1000000
  batch_size: 2048
  gamma: 0.95
  learning_rate: !!float 1e-3
  noise_type: 'normal'
  noise_std: 0.1
  replay_buffer_class: HerReplayBuffer
  replay_buffer_kwargs: "dict(
    online_sampling=True,
    goal_selection_strategy='future',
    n_sampled_goal=4,
  )"
  policy_kwargs: "dict(net_arch=[512, 512, 512], n_critics=2)"

then run

python train.py --algo ddpg --env PandaPush-v2

Here is the result you will get:

I should also converge with PandaPickandPlace-v2. Feel free to open a PR in the zoo like this one to share you results.