qgallouedec / panda-gym

Set of robotic environments based on PyBullet physics engine and gymnasium.
MIT License
564 stars 118 forks source link

PandaPickAndPlace-v3 Training and Hyperparameters #66

Closed tindiz closed 1 year ago

tindiz commented 1 year ago

Hi @qgallouedec,

I have been trying to reproduce the results of some of the experiments, in particular for the PandaPickAndPlace task. However, I was only able to find hyperparameters for v1. Should results be reproducible for v3?

I tried using both the DDPG and TQC. However, I mostly focused on TQC since it is clearly documented in two places: https://huggingface.co/qgallouedec/tqc-PandaPickAndPlace-v1-3157870761 and https://wandb.ai/openrlbenchmark/sb3.

I can't get anywhere near the results presented in these two sources. I also tried to train the same agent in a dense environment as a sort of sanity check. The results were quite good, the success rate goes above 90% without any issues.

To Reproduce

Here is an example of the code I have been trying to run. For your convenience, I removed all callbacks and checkpoints. Also, I am using the bleeding edge version for all the packages, as presented in the docs.

import gymnasium as gym
import panda_gym
from stable_baselines3 import HerReplayBuffer
from sb3_contrib import TQC

env = gym.make("PandaPickAndPlace-v3")

model = TQC(
    "MultiInputPolicy",
    env,
    batch_size=2048,
    buffer_size=1_000_000,
    gamma=0.95,
    learning_rate=0.001,
    policy_kwargs=dict(net_arch=[512, 512, 512], n_critics=2),
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(goal_selection_strategy='future', n_sampled_goal=4),
    tau=0.05,
    seed=3157870761,
    verbose=1
)

model.learn(
    total_timesteps=1500000.0,
    progress_bar=True
)
qgallouedec commented 1 year ago

This is very surprising. There are no big changes between v1 and v3. The friction is better managed and that's it. I'll take a look on my end, I'll get back to you.

tindiz commented 1 year ago

This is very surprising. There are no big changes between v1 and v3. The friction is better managed and that's it. I'll take a look on my end, I'll get back to you.

Sounds good. Let me know if there is anything I can do to help out.

qgallouedec commented 1 year ago

Have you tried to run experiments with rl-zoo3? Can you share your plots?

tindiz commented 1 year ago

I haven't tried rl-zoo3. I wanted to train it myself, as shown in the code block.

I don't have plots at the moment but will try to log training now. It might take some time... Unfortunately, I didn't run it with the Tensorboard callback during training. However, I can get the models generated at checkpoints if that works as well.

tindiz commented 1 year ago

I just realized that the way I was loading the model from a checkpoint isn't correct and does not work properly. This might be the issue. Please give me some time to investigate, I will keep you updated.

Sorry for wasting your time.

tindiz commented 1 year ago

Hi, I am getting back to you with more information. I was not able to replicate results even when training continuously. I am attaching code, plots and environment-related information. Please let me know if you need anything else or if you find a bug in my code.

Local Environment

Plots

Success Rate

rollout_success_rate

Reward

rollout_ep_rew_mean

Episode Length

rollout_ep_len_mean

Code (in its entirety)

import gymnasium as gym
import panda_gym
import numpy as np
import datetime

from stable_baselines3 import HerReplayBuffer
from stable_baselines3.common.callbacks import CheckpointCallback
from sb3_contrib import TQC

env = gym.make("PandaPickAndPlace-v3")

# Create TQC agent:
model = TQC(
    "MultiInputPolicy",
    env,
    batch_size=2048,
    buffer_size=1_000_000,
    gamma=0.95,
    learning_rate=0.001,
    policy_kwargs=dict(net_arch=[512, 512, 512], n_critics=2),
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(goal_selection_strategy='future', n_sampled_goal=4),
    tau=0.05,
    seed=3157870761,
    verbose=1,
    tensorboard_log='./tensorboard/TQC/',
)

stringified_time = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
checkpoint_callback = CheckpointCallback(
    save_freq=100_000, 
    save_path=f"./models/{stringified_time}/", 
    name_prefix="tqc_panda_pick_and_place"
)  # Create checkpoint callback.

# Model training: 
model.learn(
    total_timesteps=1_100_000, 
    callback=checkpoint_callback, 
    progress_bar=True
)
model.save("tqc_panda_pick_and_place_final")  # Save final model.

System Information

Colab Experiment

I tried training it in Colab as well, the environment timed out at around 400k steps, I am also attaching the same information for that experiment. The results do not look the same to me, but I could not find the difference in the code. I can share the notebook as well. :)

Plots

Success Rate

rollout_success_rate_colab

Code

!pip install panda-gym
!pip install git+https://github.com/DLR-RM/stable-baselines3
!pip install git+https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/
!pip install tqdm
!pip install rich

import gymnasium as gym
import panda_gym
import numpy as np
import datetime
from stable_baselines3 import HerReplayBuffer
from stable_baselines3.common.callbacks import CheckpointCallback
from sb3_contrib import TQC

base_path = '<user-specific-after-mounting-drive>'

env = gym.make("PandaPickAndPlace-v3")

model = TQC(
    "MultiInputPolicy",
    env,
    batch_size=2048,
    buffer_size=1000000,
    gamma=0.95,
    learning_rate=0.001,
    policy_kwargs=dict(net_arch=[512, 512, 512], n_critics=2),
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(goal_selection_strategy='future', n_sampled_goal=4),
    tau=0.05,
    seed=3157870761,
    verbose=1,
    tensorboard_log=f'{base_path}/tensorboard/',
)

stringified_time = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
checkpoint_callback = CheckpointCallback( 
    save_freq=10_000,
    save_path=f"{base_path}/models/{stringified_time}/", 
    name_prefix="tqc_panda_pick_and_place"
)  # Callback for saving the model

# Model training: 
model.learn(
    total_timesteps=1_000_000.0,
    callback=checkpoint_callback, 
    progress_bar=True
)
model.save(f"{base_path}/tqc_panda_pick_and_place_final")

System Information

qgallouedec commented 1 year ago

Thanks, I'll take a look. I'm going back to you soon.

benquick123 commented 1 year ago

Hi, has anyone figured it out at the end? I can't reproduce PIckAndPlace results using TQC or SAC with either huggingface hyperparameters or hyperparameters from the panda-gym paper.

tindiz commented 1 year ago

Hi, I made no progress. I got inconsistent results and never managed to replicate the ones documented.

benquick123 commented 1 year ago

Ok, the error was actually on my side. While I still can't reproduce SAC results, TQC works with huggingface hyperparameters after fixing the bug in my code.

zichunxx commented 11 months ago

Hi! Has anyone successfully completed the pick and place task with DDPG or SAC? I'm confused about the reason for the failure. What possible factors caused this?