haosulab / ManiSkill2-Learn

Apache License 2.0
80 stars 16 forks source link

Value broadcast error #23

Closed lakshitadodeja closed 3 months ago

lakshitadodeja commented 11 months ago

Hi, I am trying to train an agent with dapg+ppo on the turn faucet environment using rgbd observations. I have used the demonstration conversion script to convert the demonstrations in the required format. I am training using only 1 GPU using the command -

python maniskill2_learn/apis/run_rl.py configs/mfrl/dapg/maniskill2_rgbd.py \
            --work-dir ./logs/dapg_turnfaucet_rgbd --gpu-ids 0 \
            --cfg-options "env_cfg.env_name=TurnFaucet-v0" "env_cfg.obs_mode=rgbd" \
            "env_cfg.control_mode=pd_ee_delta_pose" \
            "rollout_cfg.num_procs=12" "env_cfg.reward_mode=dense" \
            "agent_cfg.demo_replay_cfg.buffer_filenames=../ManiSkill2/demos/v0/rigid_body/TurnFaucet-v0/trajectory_merged.none.pd_ee_delta_pose_rgbd.h5" \
            "eval_cfg.num=100" "eval_cfg.save_traj=False" "eval_cfg.save_video=True" \
            "train_cfg.total_steps=50000000" "train_cfg.n_checkpoint=5000000"``python maniskill2_learn/apis/run_rl.py configs/mfrl/dapg/maniskill2_rgbd.py \
            --work-dir ./logs/dapg_turnfaucet_rgbd --gpu-ids 0 \
            --cfg-options "env_cfg.env_name=TurnFaucet-v0" "env_cfg.obs_mode=rgbd" \
            "env_cfg.control_mode=pd_ee_delta_pose" \
            "rollout_cfg.num_procs=12" "env_cfg.reward_mode=dense" \
            "agent_cfg.demo_replay_cfg.buffer_filenames=../ManiSkill2/demos/v0/rigid_body/TurnFaucet-v0/trajectory_merged.none.pd_ee_delta_pose_rgbd.h5" \
            "eval_cfg.num=100" "eval_cfg.save_traj=False" "eval_cfg.save_video=True" \
            "train_cfg.total_steps=50000000" "train_cfg.n_checkpoint=5000000"

I am getting the following error while training -

Process Worker-5:
Traceback (most recent call last):
  File "/home/dodeja/miniconda3/envs/mani_skill2/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/home/dodeja/Documents/ilrl/ManiSkill2-Learn/maniskill2_learn/utils/meta/parallel_runner.py", line 132, in run
    ret = getattr(func, func_name)(*args, **kwargs)
  File "/home/dodeja/Documents/ilrl/ManiSkill2-Learn/maniskill2_learn/env/wrappers.py", line 46, in reset
    self.reset_buffer.assign_all(self.env.reset(*args, **kwargs))
  File "/home/dodeja/Documents/ilrl/ManiSkill2-Learn/maniskill2_learn/utils/data/dict_array.py", line 638, in assign_all
    self.memory = self._assign(self.memory, slice(None, None, None), value)
  File "/home/dodeja/Documents/ilrl/ManiSkill2-Learn/maniskill2_learn/utils/data/dict_array.py", line 471, in _assign
    memory[key] = cls._assign(memory[key], indices, value[key], ignore_list)
  File "/home/dodeja/Documents/ilrl/ManiSkill2-Learn/maniskill2_learn/utils/data/dict_array.py", line 474, in _assign
    if share_memory(memory, value):
  File "/home/dodeja/Documents/ilrl/ManiSkill2-Learn/maniskill2_learn/utils/data/array_ops.py", line 296, in share_memory
    ret = x.base is not None and y.base is not None and x.base == y.base
ValueError: operands could not be broadcast together with shapes (12,6,128,128) (128,128,6) 

I am assuming this is because the num of processes is defined as 12, do i have to make any manual changes to make this work? (Also when i try to do the same using num process as 1 the memory overflows)

xuanlinli17 commented 11 months ago

Sorry for the late reply. I wasn't able to reproduce the error. Did you follow the demonstration generation script in https://github.com/haosulab/ManiSkill2-Learn/blob/main/scripts/example_demo_conversion/general_rigid_body_multi_object_envs.sh ?

Also what is your maniskill2 and maniskill2-learn version?