qgallouedec / panda-gym

Set of robotic environments based on PyBullet physics engine and gymnasium.
MIT License
492 stars 106 forks source link

PandaPickAndPlace-v3 Training and Hyperparameters #66

Closed tindiz closed 8 months ago

tindiz commented 1 year ago

Hi @qgallouedec,

I have been trying to reproduce the results of some of the experiments, in particular for the PandaPickAndPlace task. However, I was only able to find hyperparameters for v1. Should results be reproducible for v3?

I tried using both the DDPG and TQC. However, I mostly focused on TQC since it is clearly documented in two places: https://huggingface.co/qgallouedec/tqc-PandaPickAndPlace-v1-3157870761 and https://wandb.ai/openrlbenchmark/sb3.

I can't get anywhere near the results presented in these two sources. I also tried to train the same agent in a dense environment as a sort of sanity check. The results were quite good, the success rate goes above 90% without any issues.

To Reproduce

Here is an example of the code I have been trying to run. For your convenience, I removed all callbacks and checkpoints. Also, I am using the bleeding edge version for all the packages, as presented in the docs.

import gymnasium as gym
import panda_gym
from stable_baselines3 import HerReplayBuffer
from sb3_contrib import TQC

env = gym.make("PandaPickAndPlace-v3")

model = TQC(
    "MultiInputPolicy",
    env,
    batch_size=2048,
    buffer_size=1_000_000,
    gamma=0.95,
    learning_rate=0.001,
    policy_kwargs=dict(net_arch=[512, 512, 512], n_critics=2),
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(goal_selection_strategy='future', n_sampled_goal=4),
    tau=0.05,
    seed=3157870761,
    verbose=1
)

model.learn(
    total_timesteps=1500000.0,
    progress_bar=True
)
qgallouedec commented 1 year ago

This is very surprising. There are no big changes between v1 and v3. The friction is better managed and that's it. I'll take a look on my end, I'll get back to you.

tindiz commented 1 year ago

This is very surprising. There are no big changes between v1 and v3. The friction is better managed and that's it. I'll take a look on my end, I'll get back to you.

Sounds good. Let me know if there is anything I can do to help out.

qgallouedec commented 1 year ago

Have you tried to run experiments with rl-zoo3? Can you share your plots?

tindiz commented 1 year ago

I haven't tried rl-zoo3. I wanted to train it myself, as shown in the code block.

I don't have plots at the moment but will try to log training now. It might take some time... Unfortunately, I didn't run it with the Tensorboard callback during training. However, I can get the models generated at checkpoints if that works as well.

tindiz commented 1 year ago

I just realized that the way I was loading the model from a checkpoint isn't correct and does not work properly. This might be the issue. Please give me some time to investigate, I will keep you updated.

Sorry for wasting your time.

tindiz commented 1 year ago

Hi, I am getting back to you with more information. I was not able to replicate results even when training continuously. I am attaching code, plots and environment-related information. Please let me know if you need anything else or if you find a bug in my code.

Local Environment

Plots

Success Rate

rollout_success_rate

Reward

rollout_ep_rew_mean

Episode Length

rollout_ep_len_mean

Code (in its entirety)

import gymnasium as gym
import panda_gym
import numpy as np
import datetime

from stable_baselines3 import HerReplayBuffer
from stable_baselines3.common.callbacks import CheckpointCallback
from sb3_contrib import TQC

env = gym.make("PandaPickAndPlace-v3")

# Create TQC agent:
model = TQC(
    "MultiInputPolicy",
    env,
    batch_size=2048,
    buffer_size=1_000_000,
    gamma=0.95,
    learning_rate=0.001,
    policy_kwargs=dict(net_arch=[512, 512, 512], n_critics=2),
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(goal_selection_strategy='future', n_sampled_goal=4),
    tau=0.05,
    seed=3157870761,
    verbose=1,
    tensorboard_log='./tensorboard/TQC/',
)

stringified_time = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
checkpoint_callback = CheckpointCallback(
    save_freq=100_000, 
    save_path=f"./models/{stringified_time}/", 
    name_prefix="tqc_panda_pick_and_place"
)  # Create checkpoint callback.

# Model training: 
model.learn(
    total_timesteps=1_100_000, 
    callback=checkpoint_callback, 
    progress_bar=True
)
model.save("tqc_panda_pick_and_place_final")  # Save final model.

System Information

Colab Experiment

I tried training it in Colab as well, the environment timed out at around 400k steps, I am also attaching the same information for that experiment. The results do not look the same to me, but I could not find the difference in the code. I can share the notebook as well. :)

Plots

Success Rate

rollout_success_rate_colab

Code

!pip install panda-gym
!pip install git+https://github.com/DLR-RM/stable-baselines3
!pip install git+https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/
!pip install tqdm
!pip install rich

import gymnasium as gym
import panda_gym
import numpy as np
import datetime
from stable_baselines3 import HerReplayBuffer
from stable_baselines3.common.callbacks import CheckpointCallback
from sb3_contrib import TQC

base_path = '<user-specific-after-mounting-drive>'

env = gym.make("PandaPickAndPlace-v3")

model = TQC(
    "MultiInputPolicy",
    env,
    batch_size=2048,
    buffer_size=1000000,
    gamma=0.95,
    learning_rate=0.001,
    policy_kwargs=dict(net_arch=[512, 512, 512], n_critics=2),
    replay_buffer_class=HerReplayBuffer,
    replay_buffer_kwargs=dict(goal_selection_strategy='future', n_sampled_goal=4),
    tau=0.05,
    seed=3157870761,
    verbose=1,
    tensorboard_log=f'{base_path}/tensorboard/',
)

stringified_time = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
checkpoint_callback = CheckpointCallback( 
    save_freq=10_000,
    save_path=f"{base_path}/models/{stringified_time}/", 
    name_prefix="tqc_panda_pick_and_place"
)  # Callback for saving the model

# Model training: 
model.learn(
    total_timesteps=1_000_000.0,
    callback=checkpoint_callback, 
    progress_bar=True
)
model.save(f"{base_path}/tqc_panda_pick_and_place_final")

System Information

qgallouedec commented 1 year ago

Thanks, I'll take a look. I'm going back to you soon.

benquick123 commented 9 months ago

Hi, has anyone figured it out at the end? I can't reproduce PIckAndPlace results using TQC or SAC with either huggingface hyperparameters or hyperparameters from the panda-gym paper.

tindiz commented 9 months ago

Hi, I made no progress. I got inconsistent results and never managed to replicate the ones documented.

benquick123 commented 9 months ago

Ok, the error was actually on my side. While I still can't reproduce SAC results, TQC works with huggingface hyperparameters after fixing the bug in my code.

zichunxx commented 7 months ago

Hi! Has anyone successfully completed the pick and place task with DDPG or SAC? I'm confused about the reason for the failure. What possible factors caused this?