Closed zanellar closed 1 year ago
Hi, The reward being sparse it is normal that you don't get good results if you don't use a structured exploration strategy (like HER). I think that the only task that can be learned without any exploration strategy is PandaReach.
Having said that, I draw your attention to the fact that HER can't work with PPO because it is an on-policy algorithm.
You can find good hyperparameters on RL baselines3 Zoo, but only for the TQC algorithm for the moment. By the way, we are currently building an open benchmark for many algorithms and environments, including panda-gym, see openrlbenchmark.
I plan to extend this benchmark on panda-gym to other algorithms than TQC, but it's not my priority for the moment. If you find good hyperparam for SAC and DDPG, please share them.
I got the SAC implementation from stable-baselines3 working on PandaPush-v3 with both sparse and dense reward:
I used 10 environments in parallel with max_episode_steps=100
for both sparse and dense reward.
SAC settings: sb3 defaults + learning_starts=100000
, gradient_steps=-1
, steps=3000000
(nothing fancy, just letting it run for a while)
Will add these to RL baselines3 Zoo after more hyper parameter tuning.
This is great news! Do you use SAC with HER? Can you anticipate the integration to openrlbenchmark by tracking your experiments with wandb? The instructions are here: https://github.com/openrlbenchmark/openrlbenchmark/issues/7 For now, track them in a personal project and we'll move them to openrlbenchmark afterwards
For now, I just used the regular experience replay buffer, no HER, might have to try this. Thanks for the openrlbenchmark hint, I will look into this! Though my current research focusses on vision-based RL, so I'm not really using the default observation/state representation.
You can find good hyperparameters on RL baselines3 Zoo, but only for the TQC algorithm for the moment. By the way, we are currently building an open benchmark for many algorithms and environments, including panda-gym, see openrlbenchmark.
Did you solve the panda push problem with the action space defined as 3 values (the end effector position) or using the 7 joints actions? Regarding the joints based action space, is it a position, velocity or torque control?
Did you solve the panda push problem with the action space defined as 3 values (the end effector position) or using the 7 joints actions?
We used PandaPush
, where the observation and the control is related to the end-effector.
Regarding the joints based action space, is it a position, velocity or torque control?
The action is a target displacement. First, the raw action is scaled by 0.05. The result is added to the current joint position to obtain the target angles of the joints. Then, PyBullet uses a PD controller to compute the torque applied on each joint. Thus, we can think of the action as a virtual force applied on the joints.
Related: https://github.com/qgallouedec/panda-gym/issues/37#issuecomment-1284237047
I now consider this question as solved as openrlbenchmark is in its first version. You can see all results and hyperparameters here: https://wandb.ai/openrlbenchmark/sb3
@qgallouedec Hi! It seems that all PickAndPlace tasks are finished successfully with TQC in https://wandb.ai/openrlbenchmark/sb3. There are no recommended hyperparameters for DDPG and SAC.
Hi, can you provide some benchmarking results with the correspondent algorithm and hyperparameters for the 4 tasks? I've tried SAC, PPO and DDPG but couldn't train an agent for reaching good results (I'm focusing on PandaPickAndPlace and PandaPush)