ManiSkill2 PPO error at end of training for Excavate-v0: "assert len(self.mpm_particle_q) <= model.max_n_particles"

DanielTakeshi commented 1 year ago

On an Ubuntu 20.04 machine with RTX 3090 GPUs (each has 24G of memory), and having installed ManiSkill2 (and ManiSkill2-Learn) as per both READMEs, I am training PPO to get a sense of the task difficulty and to better understand the code. I put this in a bash script:

# Try to make this consistent for different envs.
ENV="Excavate-v0"
LOGDIR="logs/${ENV}_ppo_pn"
ENVCFG="env_cfg.env_name=${ENV}"

python maniskill2_learn/apis/run_rl.py configs/mfrl/ppo/maniskill2_pn.py \
    --work-dir $LOGDIR --gpu-ids 0 \
    --cfg-options $ENVCFG "env_cfg.obs_mode=pointcloud" \
        "env_cfg.n_points=1200" "env_cfg.control_mode=pd_joint_delta_pos" \
        "rollout_cfg.num_procs=10" \
        "eval_cfg.num=100" "eval_cfg.save_traj=False" "eval_cfg.save_video=True" \
        "eval_cfg.num_procs=10"

This is based on the PPO instructions in the README here. The only minor differences are that I format the LOGDIR and ENVCFG to make it easier to swap around different environments. Also I use 10 processes instead of 5 (but I'm running this on a machine where htop shows 80 CPUs).

If I put this in ppo.sh and run it via ./ppo.sh on the command line, it seems like it will train but then I get this near what seems like the end of training:

Excavate-v0-train - (train_rl.py:371) - INFO - 2022-08-06,17:03:02 - 4860000/5000000(97%) Passed time:20h24m49s ETA:35m16s samples_stats: rewards:0.0[0.0, 0.0], max_single_R:0.00[0.00, 0.00], lens:250[250, 250], success:0.00 gpu_mem_ratio: 84.1% gpu_mem: 20.19G gpu_mem_this: 19.88G gpu_util: 84% old_log_p: -5.861 adv_mean: 0 adv_std: 1.011e-08 max_normed_adv: 0.180 v_target: -7.909e-08 ori_returns: 0 critic_err: 0 policy_std: 0.731 entropy: 7.636 mean_p_ratio: 1.000 max_p_ratio: 1.103 log_p: -5.864 clip_frac: 1.292e-03 approx_kl: 1.013e-04 actor_loss: 1.702e-05 entropy_loss: 0 grad_norm: 2.194e-03 visual_grad: 1.956e-04 actor_mlp_grad: 1.002e-03 critic_mlp_grad: 5.651e-06 clipped_grad_norm: 2.194e-03 max_policy_abs: 1.411 policy_norm: 40.214 max_critic_abs: 1.411 critic_norm: 37.319 num_actor_epoch: 2.000 episode_time: 301.416 collect_sample_time: 285.566 memory: 15.93G
Excavate-v0-train - (rollout.py:117) - INFO - 2022-08-06,17:07:51 - Finish with 20000 samples, simulation time/FPS:282.31/70.85, agent time/FPS:3.85/5195.03, overhead time:2.29
Excavate-v0-train - (ppo.py:364) - INFO - 2022-08-06,17:07:51 - Number of batches in one PPO epoch: 61!
Excavate-v0-train - (train_rl.py:371) - INFO - 2022-08-06,17:08:07 - 4880000/5000000(98%) Passed time:20h29m54s ETA:30m14s samples_stats: rewards:0.0[0.0, 0.0], max_single_R:0.00[0.00, 0.00], lens:250[250, 250], success:0.00 gpu_mem_ratio: 84.1% gpu_mem: 20.19G gpu_mem_this: 19.88G gpu_util: 85% old_log_p: -5.867 adv_mean: 0 adv_std: 1.027e-08 max_normed_adv: 0.929 v_target: 1.026e-06 ori_returns: 0 critic_err: 0 policy_std: 0.731 entropy: 7.636 mean_p_ratio: 1.000 max_p_ratio: 1.194 log_p: -5.868 clip_frac: 2.757e-03 approx_kl: 3.911e-04 actor_loss: 1.746e-05 entropy_loss: 0 grad_norm: 4.998e-03 visual_grad: 3.009e-04 actor_mlp_grad: 1.807e-03 critic_mlp_grad: 6.374e-06 clipped_grad_norm: 4.998e-03 max_policy_abs: 1.410 policy_norm: 40.215 max_critic_abs: 1.410 critic_norm: 37.320 num_actor_epoch: 2.000 episode_time: 304.539 collect_sample_time: 288.669 memory: 15.93G
Excavate-v0-train - (rollout.py:117) - INFO - 2022-08-06,17:12:55 - Finish with 20000 samples, simulation time/FPS:281.38/71.08, agent time/FPS:3.88/5159.96, overhead time:2.22
Excavate-v0-train - (ppo.py:364) - INFO - 2022-08-06,17:12:55 - Number of batches in one PPO epoch: 61!
Excavate-v0-train - (train_rl.py:371) - INFO - 2022-08-06,17:13:11 - 4900000/5000000(98%) Passed time:20h34m58s ETA:25m12s samples_stats: rewards:0.0[0.0, 0.0], max_single_R:0.00[0.00, 0.00], lens:250[250, 250], success:0.00 gpu_mem_ratio: 84.1% gpu_mem: 20.19G gpu_mem_this: 19.88G gpu_util: 28% old_log_p: -5.832 adv_mean: 0 adv_std: 1.015e-08 max_normed_adv: 0.285 v_target: 1.071e-06 ori_returns: 0 critic_err: 0 policy_std: 0.730 entropy: 7.633 mean_p_ratio: 1.000 max_p_ratio: 1.253 log_p: -5.832 clip_frac: 4.794e-03 approx_kl: 8.790e-04 actor_loss: -1.949e-05 entropy_loss: 0 grad_norm: 2.688e-03 visual_grad: 1.463e-04 actor_mlp_grad: 9.339e-04 critic_mlp_grad: 4.599e-06 clipped_grad_norm: 2.688e-03 max_policy_abs: 1.410 policy_norm: 40.222 max_critic_abs: 1.410 critic_norm: 37.326 num_actor_epoch: 2.000 episode_time: 303.596 collect_sample_time: 287.702 memory: 15.93G
Process Worker-16:
Traceback (most recent call last):
  File "/home/seita/miniconda3/envs/mani_skill2/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/seita/ManiSkill2/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 132, in run
    ret = getattr(func, func_name)(*args, **kwargs)
  File "/home/seita/ManiSkill2/ManiSkill2-Learn/maniskill2_learn/env/wrappers.py", line 37, in reset
    self.reset_buffer.assign_all(self.env.reset(*args, **kwargs))
  File "/home/seita/ManiSkill2/ManiSkill2-Learn/maniskill2_learn/env/wrappers.py", line 84, in reset
    obs = self.env.reset(*args, **kwargs)
  File "/home/seita/miniconda3/envs/mani_skill2/lib/python3.8/site-packages/gym/wrappers/time_limit.py", line 27, in reset
    return self.env.reset(**kwargs)
  File "/home/seita/ManiSkill2/ManiSkill2-Learn/maniskill2_learn/env/wrappers.py", line 198, in reset
    obs = self.env.reset(**kwargs)
  File "/home/seita/ManiSkill2/ManiSkill2-Learn/maniskill2_learn/env/wrappers.py", line 402, in reset
    obs = super().reset(**kwargs)
  File "/home/seita/miniconda3/envs/mani_skill2/lib/python3.8/site-packages/gym/core.py", line 251, in reset
    return self.env.reset(**kwargs)
  File "/home/seita/ManiSkill2/mani_skill2/envs/mpm/excavate_env.py", line 180, in reset
    ret = super().reset(seed=seed, reconfigure=reconfigure)
  File "/home/seita/ManiSkill2/mani_skill2/envs/mpm/base_env.py", line 296, in reset
    return super().reset(*args, **kwargs)
  File "/home/seita/ManiSkill2/mani_skill2/envs/sapien_env.py", line 340, in reset
    self.initialize_episode()
  File "/home/seita/ManiSkill2/mani_skill2/envs/mpm/base_env.py", line 300, in initialize_episode
    self._initialize_mpm()
  File "/home/seita/ManiSkill2/mani_skill2/envs/mpm/excavate_env.py", line 81, in _initialize_mpm
    self.model_builder.init_model_state(self.mpm_model, self.mpm_states[0])
  File "/home/seita/ManiSkill2/warp_maniskill/mpm/mpm_model.py", line 513, in init_model_state
    assert len(self.mpm_particle_q) <= model.max_n_particles
AssertionError

Also here is the log directory that has been created from this:

(mani_skill2) seita@lambda-dual2:~/ManiSkill2/ManiSkill2-Learn (main) $ ls -lh logs/Excavate-v0_ppo_pn/*
-rw-rw-r-- 1 seita seita 290K Aug  6 17:13 logs/Excavate-v0_ppo_pn/20220805_203734-train.log
-rw-rw-r-- 1 seita seita 2.5K Aug  5 20:37 logs/Excavate-v0_ppo_pn/20220805_203734-train.py

logs/Excavate-v0_ppo_pn/models:
total 24M
-rw-rw-r-- 1 seita seita 5.9M Aug  6 00:49 model_1000000.ckpt
-rw-rw-r-- 1 seita seita 5.9M Aug  6 05:01 model_2000000.ckpt
-rw-rw-r-- 1 seita seita 5.9M Aug  6 09:13 model_3000000.ckpt
-rw-rw-r-- 1 seita seita 5.9M Aug  6 13:26 model_4000000.ckpt

logs/Excavate-v0_ppo_pn/tf_logs:
total 616K
-rw-rw-r-- 1 seita seita 611K Aug  6 17:13 events.out.tfevents.1659746292.lambda-dual2.3950.0
(mani_skill2) seita@lambda-dual2:~/ManiSkill2/ManiSkill2-Learn (main) $

Here is the output of the train.py file which shows training details:

``` (mani_skill2) seita@lambda-dual2:~/ManiSkill2/ManiSkill2-Learn (main) $ cat logs/Excavate-v0_ppo_pn/20220805_203734-train.py agent_cfg = dict( type='PPO', gamma=0.95, lmbda=0.95, critic_coeff=0.5, entropy_coeff=0, critic_clip=False, obs_norm=False, rew_norm=True, adv_norm=True, recompute_value=True, num_epoch=2, critic_warmup_epoch=4, batch_size=330, detach_actor_feature=False, max_grad_norm=0.5, eps_clip=0.2, max_kl=0.2, dual_clip=None, shared_backbone=True, ignore_dones=True, actor_cfg=dict( type='ContinuousActor', head_cfg=dict( type='GaussianHead', init_log_std=-1, clip_return=True, predict_std=False), nn_cfg=dict( type='Visuomotor', visual_nn_cfg=dict( type='PointNet', feat_dim='pcd_all_channel', mlp_spec=[64, 128, 512], feature_transform=[]), mlp_cfg=dict( type='LinearMLP', norm_cfg=None, mlp_spec=['512 + agent_shape', 256, 256, 'action_shape'], inactivated_output=True, zero_init_output=True)), optim_cfg=dict( type='Adam', lr=0.0003, param_cfg=dict({'(.*?)visual_nn(.*?)': None}))), critic_cfg=dict( type='ContinuousCritic', nn_cfg=dict( type='Visuomotor', visual_nn_cfg=None, mlp_cfg=dict( type='LinearMLP', norm_cfg=None, mlp_spec=['512 + agent_shape', 256, 256, 1], inactivated_output=True, zero_init_output=True)), optim_cfg=dict(type='Adam', lr=0.0003))) train_cfg = dict( on_policy=True, total_steps=5000000, warm_steps=0, n_steps=20000, n_updates=1, n_eval=5000000, n_checkpoint=1000000, ep_stats_cfg=dict(info_keys_mode=dict(success=[True, 'max', 'mean']))) env_cfg = dict( type='gym', env_name='Excavate-v0', obs_mode='pointcloud', ignore_dones=True, n_points=1200, control_mode='pd_joint_delta_pos') replay_cfg = dict(type='ReplayMemory', capacity=20000) rollout_cfg = dict( type='Rollout', num_procs=10, with_info=True, multi_thread=False) eval_cfg = dict( type='Evaluation', num_procs=10, num=100, use_hidden_state=False, save_traj=False, save_video=True, log_every_step=False, env_cfg=dict(ignore_dones=False)) work_dir = None resume_from = None expert_replay_cfg = None recent_traj_replay_cfg = None (mani_skill2) seita@lambda-dual2:~/ManiSkill2/ManiSkill2-Learn (main) $ ```

I have successfully trained PPO training from scratch for PickCube-v0, PegInsertionSide-v0, PlugCharger-v0, and StackCube-v0 so I don't know if this is specific to the soft body environments. Also it seems like it happened near the end of training (I think 5M is default) so it might be hard to reproduce. But just to ask are there known ways to counter this (or is this a known issue with this environment)?

fbxiang commented 1 year ago

Hi Daniel, I believe it is due to an oversight in the excavate env. This env somehow sets a completely unnecessary particle cap at excavate.py line 31. You can delete the max_particles argument in the parameter and the super().__init__ calls to use the default particle count (65536). This should resolve the issue. I will try to patch it together with the release of other environments.

DanielTakeshi commented 1 year ago

Thanks @fbxiang including that patch would be be helpful.

xuanlinli17 commented 1 year ago

BTW MPM environments use sparse reward by default, even though dense rewards have been implemented. Please pass in env_cfg.reward_mode=dense for training.

DanielTakeshi commented 1 year ago

Thanks for the fast responses @fbxiang and @xuanlinli17 !

I see the dense reward is used by default here in this pull request https://github.com/haosulab/ManiSkill2/pull/5

Also to clarify @fbxiang for anyone who's reading this, until it gets patched, you want to delete both 31 and 38 here:

https://github.com/haosulab/ManiSkill2/blob/5927330d822df9083ef0a98c21022691a425a30e/mani_skill2/envs/mpm/excavate_env.py#L24-L40

Jiayuan-Gu commented 1 year ago

The issue should be fixed in v0.2.0 by https://github.com/haosulab/ManiSkill2/pull/12

haosulab / ManiSkill2-Learn

ManiSkill2 PPO error at end of training for Excavate-v0: "assert len(self.mpm_particle_q) <= model.max_n_particles" #1