Reproducing DAPG+PPO baseline results with RGBD input on Pick/StackCube and PickSingleYCB envs

Zed-Wu commented 1 year ago

I tried to reproduce the DAPG+PPO baseline results with RGBD input on Pick/StackCube and PickSingleYCB envs, but all results of success rates are nearly 0.00. Here are the commands: For PickCube:

python maniskill2_learn/apis/run_rl.py configs/mfrl/dapg/maniskill2_rgbd.py \
            --work-dir xxx --gpu-ids 0 --sim-gpu-ids 0 \
            --cfg-options "env_cfg.env_name=PickCube-v0" "env_cfg.obs_mode=rgbd" \
            "env_cfg.control_mode=pd_ee_delta_pose" \
            "rollout_cfg.num_procs=12" "env_cfg.reward_mode=dense" \
            "agent_cfg.demo_replay_cfg.buffer_filenames=../ManiSkill2/demos/rigid_body/PickCube-v0/trajectory.none.pd_ee_delta_pose_rgbd.h5" \
            "eval_cfg.num=100" "eval_cfg.save_traj=True" "eval_cfg.save_video=True" \
            "train_cfg.total_steps=20000000" "train_cfg.n_checkpoint=1000000" "train_cfg.n_eval=1000000"

For StackCube:

python maniskill2_learn/apis/run_rl.py configs/mfrl/dapg/maniskill2_rgbd.py \
            --work-dir xxx --gpu-ids 0 --sim-gpu-ids 0 \
            --cfg-options "env_cfg.env_name=StackCube-v0" "env_cfg.obs_mode=rgbd" \
            "env_cfg.control_mode=pd_ee_delta_pose" \
            "rollout_cfg.num_procs=3" "env_cfg.reward_mode=dense" \
            "agent_cfg.demo_replay_cfg.buffer_filenames=../ManiSkill2/demos/rigid_body/StackCube-v0/trajectory.none.pd_ee_delta_pose_rgbd.h5" \
            "eval_cfg.num=100" "eval_cfg.save_traj=True" "eval_cfg.save_video=True" \
            "train_cfg.total_steps=20000000" "train_cfg.n_checkpoint=1000000" "train_cfg.n_eval=1000000"

For PickSingelYCB:

python maniskill2_learn/apis/run_rl.py configs/mfrl/dapg/maniskill2_rgbd.py \
            --work-dir xxx --gpu-ids 0 --sim-gpu-ids 0 \
            --cfg-options "env_cfg.env_name=PickSingleYCB-v0" "env_cfg.obs_mode=rgbd" \
            "rollout_cfg.num_procs=6" "env_cfg.reward_mode=dense" \
            "env_cfg.control_mode=pd_ee_delta_pose" \
            "agent_cfg.demo_replay_cfg.capacity=20000" "agent_cfg.demo_replay_cfg.cache_size=20000" \
            "agent_cfg.demo_replay_cfg.dynamic_loading=True" "agent_cfg.demo_replay_cfg.num_samples=-1" \
            "agent_cfg.demo_replay_cfg.buffer_filenames=../ManiSkill2/demos/rigid_body/PickSingleYCB-v0/trajectory_merged.none.pd_ee_delta_pose_rgbd.h5" \
            "eval_cfg.num=100" "eval_cfg.save_traj=True" "eval_cfg.save_video=True" \
            "train_cfg.total_steps=20000000" "train_cfg.n_checkpoint=1000000" "train_cfg.n_eval=1000000"

I generated all the data using the scrips in ManiSkill2-Learn/scripts/example_demo_conversion and did not change them. I upload the logs of the training processes here https://1drv.ms/f/s!AvKmwUwmh8xhjbE3GNujBLObgtdfNg?e=NI54UP and hope these files can help.

We also tried to collect new around 1000 successful demonstrations using your released checkpoint and train with the newly generated data using BC. The final results are much higher than the results reported in paper (e.g., 0.01 and 0.00 for PickCube and StackCube using BC with RGBD input). So we doubt if the original released demos have some bugs and hope the information can help.

xuanlinli17 commented 1 year ago

There were some camera rendering changes between old maniskill2 (0.3.0-) and new maniskill2 (0.4.0+), which might cause a bit of sample complexity difference from when we ran our baseline experiments in Sep 2022. From what I've experimented with, PointCloud DAPG should achieve similar results as before.

Which ManiSkill2 commit and ManiSkill2-Learn commit are you using? Are they the latest?

Also can you try training the policy for even longer time (50M). We also plan to modify the reward scale in 0.5.0 (scale the reward to [0,1]), which will aid convergence. You can also manually modify the envs and scale down the rewards to [0,1] (i.e., divide by max reward).

As for why performing BC using demos generated from our RL checkpoints performs well, it's because they are a lot more consistent and contain a lot fewer modalities than our original demos (which are generated using TAMP, and our large random initialization of object positions makes this problem more significant), so it is much easier for policies to learn from RL-generated demos through BC. If we have an expert RL agent, then BC using RL demos will require a lot fewer trajectories, and the policy success rate will be near the RL's success rate. We observe the same phenomenon across many different scenarios (including our last year's ManiSkill1 challenge).

Zed-Wu commented 1 year ago

I used the latest ManiSkill2 and ManiSkill2-Learn because I just installed them last week. Thank you very much for your suggestions and the explanation for BC results.

xuanlinli17 commented 1 year ago

If you try PPO (no dapg) does it get >0?

Also please double check the demo generation command (are envs, paths, etc. correct?) For YCB you might want to use the shuffled demo (trajectory_merged.none.pd_ee_delta_pose_pointcloud_shuffled.h5).

Zed-Wu commented 1 year ago

I checked the demo generation command and they are right. I think I will try only PPO when I have idle GPUs, currently, I am building my own algo on your benchmark

haosulab / ManiSkill2-Learn

Reproducing DAPG+PPO baseline results with RGBD input on Pick/StackCube and PickSingleYCB envs #10