Closed tindiz closed 1 year ago
This is very surprising. There are no big changes between v1 and v3. The friction is better managed and that's it. I'll take a look on my end, I'll get back to you.
This is very surprising. There are no big changes between v1 and v3. The friction is better managed and that's it. I'll take a look on my end, I'll get back to you.
Sounds good. Let me know if there is anything I can do to help out.
Have you tried to run experiments with rl-zoo3? Can you share your plots?
I haven't tried rl-zoo3. I wanted to train it myself, as shown in the code block.
I don't have plots at the moment but will try to log training now. It might take some time... Unfortunately, I didn't run it with the Tensorboard callback during training. However, I can get the models generated at checkpoints if that works as well.
I just realized that the way I was loading the model from a checkpoint isn't correct and does not work properly. This might be the issue. Please give me some time to investigate, I will keep you updated.
Sorry for wasting your time.
Hi, I am getting back to you with more information. I was not able to replicate results even when training continuously. I am attaching code, plots and environment-related information. Please let me know if you need anything else or if you find a bug in my code.
import gymnasium as gym
import panda_gym
import numpy as np
import datetime
from stable_baselines3 import HerReplayBuffer
from stable_baselines3.common.callbacks import CheckpointCallback
from sb3_contrib import TQC
env = gym.make("PandaPickAndPlace-v3")
# Create TQC agent:
model = TQC(
"MultiInputPolicy",
env,
batch_size=2048,
buffer_size=1_000_000,
gamma=0.95,
learning_rate=0.001,
policy_kwargs=dict(net_arch=[512, 512, 512], n_critics=2),
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs=dict(goal_selection_strategy='future', n_sampled_goal=4),
tau=0.05,
seed=3157870761,
verbose=1,
tensorboard_log='./tensorboard/TQC/',
)
stringified_time = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
checkpoint_callback = CheckpointCallback(
save_freq=100_000,
save_path=f"./models/{stringified_time}/",
name_prefix="tqc_panda_pick_and_place"
) # Create checkpoint callback.
# Model training:
model.learn(
total_timesteps=1_100_000,
callback=checkpoint_callback,
progress_bar=True
)
model.save("tqc_panda_pick_and_place_final") # Save final model.
I tried training it in Colab as well, the environment timed out at around 400k steps, I am also attaching the same information for that experiment. The results do not look the same to me, but I could not find the difference in the code. I can share the notebook as well. :)
!pip install panda-gym
!pip install git+https://github.com/DLR-RM/stable-baselines3
!pip install git+https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/
!pip install tqdm
!pip install rich
import gymnasium as gym
import panda_gym
import numpy as np
import datetime
from stable_baselines3 import HerReplayBuffer
from stable_baselines3.common.callbacks import CheckpointCallback
from sb3_contrib import TQC
base_path = '<user-specific-after-mounting-drive>'
env = gym.make("PandaPickAndPlace-v3")
model = TQC(
"MultiInputPolicy",
env,
batch_size=2048,
buffer_size=1000000,
gamma=0.95,
learning_rate=0.001,
policy_kwargs=dict(net_arch=[512, 512, 512], n_critics=2),
replay_buffer_class=HerReplayBuffer,
replay_buffer_kwargs=dict(goal_selection_strategy='future', n_sampled_goal=4),
tau=0.05,
seed=3157870761,
verbose=1,
tensorboard_log=f'{base_path}/tensorboard/',
)
stringified_time = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
checkpoint_callback = CheckpointCallback(
save_freq=10_000,
save_path=f"{base_path}/models/{stringified_time}/",
name_prefix="tqc_panda_pick_and_place"
) # Callback for saving the model
# Model training:
model.learn(
total_timesteps=1_000_000.0,
callback=checkpoint_callback,
progress_bar=True
)
model.save(f"{base_path}/tqc_panda_pick_and_place_final")
Thanks, I'll take a look. I'm going back to you soon.
Hi, has anyone figured it out at the end? I can't reproduce PIckAndPlace results using TQC or SAC with either huggingface hyperparameters or hyperparameters from the panda-gym paper.
Hi, I made no progress. I got inconsistent results and never managed to replicate the ones documented.
Ok, the error was actually on my side. While I still can't reproduce SAC results, TQC works with huggingface hyperparameters after fixing the bug in my code.
Hi! Has anyone successfully completed the pick and place
task with DDPG
or SAC
? I'm confused about the reason for the failure. What possible factors caused this?
Hi @qgallouedec,
I have been trying to reproduce the results of some of the experiments, in particular for the PandaPickAndPlace task. However, I was only able to find hyperparameters for v1. Should results be reproducible for v3?
I tried using both the DDPG and TQC. However, I mostly focused on TQC since it is clearly documented in two places: https://huggingface.co/qgallouedec/tqc-PandaPickAndPlace-v1-3157870761 and https://wandb.ai/openrlbenchmark/sb3.
I can't get anywhere near the results presented in these two sources. I also tried to train the same agent in a dense environment as a sort of sanity check. The results were quite good, the success rate goes above 90% without any issues.
To Reproduce
Here is an example of the code I have been trying to run. For your convenience, I removed all callbacks and checkpoints. Also, I am using the bleeding edge version for all the packages, as presented in the docs.