isaac-sim / IsaacGymEnvs

Isaac Gym Reinforcement Learning Environments
Other
1.94k stars 412 forks source link

IndustRealSim training time and inference performance #167

Closed wredsen closed 10 months ago

wredsen commented 1 year ago

First of all: Thanks for your very interesting paper about Sim2Real of Contact-Rich Assembly and making the related code public!

I just trained the policy from scratch for the task IndustRealTaskPegsInsert running:

python train.py task=IndustRealTaskPegsInsert headless=True

Unlike mentioned in the docs this took not only 10 hours but 50 hours instead running on a modern PC with a RTX4080.

Evaluating the inference with rendering running:

python train.py task=IndustRealTaskPegsInsert checkpoint=runs/IndustRealTaskPegsInsert_26-17-11-38/nn/last_IndustRealTaskPegsInsert_ep_8192_rew__12502.09_.pth test=True num_envs=8

With additionally num_envs=8 for performance reasons, it seems the insertion policy does not really work yet. PFA one close-up and one more far away video of those inferences.

Best regards, Konstantin

IndustReal_PiH.webm IndustReal_PiH_close.webm

bingjietang718 commented 1 year ago

Hi Konstantin,

Thank you so much for your interest in IndustReal!

Please feel free to let us know if you have any other questions.

Best, Bingjie

ismarou commented 1 year ago

Hi Bingjie,

Indeed, the problem that Konstantin mentions happened to us as well.

Before the fix you suggested, we get a 0% insertion success rate (no successes at all during training) and after that about 26% (modern GPU, about 10hours of training).

Is this supposed to be happening or there is any discrepancy?

Best, Isidoros

P.S. When the results are logged in WandB, you have measured insertion_successes/rewards over time/frame/iter, but your reported WandB loggings have all the same time scale. Shouldn't all they be separate or am I missing something here?

bingjietang718 commented 1 year ago

Hi Isidoros,

Can you revert my previously suggested fix (i.e. change back to _fingertip_centered_) and add the following code blocks here?

    def generate_ctrl_signals(self):
        """Get Jacobian. Set Franka DOF position targets or DOF torques."""
        # Get desired Jacobian
        if self.cfg_ctrl['jacobian_type'] == 'geometric':
            self.fingertip_midpoint_jacobian_tf = self.fingertip_centered_jacobian
        elif self.cfg_ctrl['jacobian_type'] == 'analytic':
            self.fingertip_midpoint_jacobian_tf = fc.get_analytic_jacobian(
                fingertip_quat=self.fingertip_quat,
                fingertip_jacobian=self.fingertip_centered_jacobian,
                num_envs=self.num_envs,
                device=self.device)
        # Set PD joint pos target or joint torque
        if self.cfg_ctrl['motor_ctrl_mode'] == 'gym':
            self._set_dof_pos_target()
        elif self.cfg_ctrl['motor_ctrl_mode'] == 'manual':
            self._set_dof_torque()

    def _set_dof_pos_target(self):
        """Set Franka DOF position target to move fingertips towards target pose."""
        self.ctrl_target_dof_pos = fc.compute_dof_pos_target(
            cfg_ctrl=self.cfg_ctrl,
            arm_dof_pos=self.arm_dof_pos,
            fingertip_midpoint_pos=self.fingertip_centered_pos,
            fingertip_midpoint_quat=self.fingertip_centered_quat,
            jacobian=self.fingertip_midpoint_jacobian_tf,
            ctrl_target_fingertip_midpoint_pos=self.ctrl_target_fingertip_centered_pos,
            ctrl_target_fingertip_midpoint_quat=self.ctrl_target_fingertip_centered_quat,
            ctrl_target_gripper_dof_pos=self.ctrl_target_gripper_dof_pos,
            device=self.device)
        self.gym.set_dof_position_target_tensor_indexed(self.sim,
                                                        gymtorch.unwrap_tensor(self.ctrl_target_dof_pos),
                                                        gymtorch.unwrap_tensor(self.franka_actor_ids_sim),
                                                        len(self.franka_actor_ids_sim))
    def _set_dof_torque(self):
        """Set Franka DOF torque to move fingertips towards target pose."""
        self.dof_torque = fc.compute_dof_torque(
            cfg_ctrl=self.cfg_ctrl,
            dof_pos=self.dof_pos,
            dof_vel=self.dof_vel,
            fingertip_midpoint_pos=self.fingertip_centered_pos,
            fingertip_midpoint_quat=self.fingertip_centered_quat,
            fingertip_midpoint_linvel=self.fingertip_centered_linvel,
            fingertip_midpoint_angvel=self.fingertip_centered_angvel,
            left_finger_force=self.left_finger_force,
            right_finger_force=self.right_finger_force,
            jacobian=self.fingertip_midpoint_jacobian_tf,
            arm_mass_matrix=self.arm_mass_matrix,
            ctrl_target_gripper_dof_pos=self.ctrl_target_gripper_dof_pos,
            ctrl_target_fingertip_midpoint_pos=self.ctrl_target_fingertip_centered_pos,
            ctrl_target_fingertip_midpoint_quat=self.ctrl_target_fingertip_centered_quat,
            ctrl_target_fingertip_contact_wrench=self.ctrl_target_fingertip_contact_wrench,
            device=self.device)
        self.gym.set_dof_actuation_force_tensor_indexed(self.sim,
                                                        gymtorch.unwrap_tensor(self.dof_torque),
                                                        gymtorch.unwrap_tensor(self.franka_actor_ids_sim),
                                                        len(self.franka_actor_ids_sim))

The difference here is now the robot end-effector control target is calculated with the assumption that the gripper fingers are perfectly mirrored which aligns with the real-world condition.

Bingjie

wredsen commented 1 year ago

Hi Konstantin,

Thank you so much for your interest in IndustReal!

  • Training time: we expect the policy to achieve a similar success rate reported in the paper with 8-10 hours for peg insertion policy. However, we did not terminate the policy training within this timeframe.
  • Can you please try to replace self.ctrl_target_fingertip_centered_pos with self.ctrl_target_fingertip_midpoint_pos in this line, and replace self.ctrl_target_fingertip_centered_quat with self.ctrl_target_fingertip_midpoint_quat in this line ?

Please feel free to let us know if you have any other questions.

Best, Bingjie

Hi Bingjie,

thanks for your adjustments. I changed both lines and it turns out, these changes work!

After 1100 iterations of the training the robot seems to be able to successfully complete a peg insertion (PFA the corresponding webm). However, the insertion strategy seems not that reliable or robust, when trying with other starting positions.

Further training seems to lead to a divergence of the strategy not beeing able to insert the peg anymore afte the full 8192 iterations (PFA the corresponding webm). There was already a strong degradation starting at 1200 iterations.

When you said

Training time: we expect the policy to achieve a similar success rate reported in the paper with 8-10 hours for peg insertion policy. However, we did not terminate the policy training within this timeframe.

Did you mean, you run the whole 8192 iterations within those 10 hours or did you terminate at comparable 1000ish iterations?

Best regards, Konstantin

IndustReal_PiH_1100.webm IndustReal_PiH_8200.webm

bingjietang718 commented 1 year ago

Hi Konstantin,

Have you tried the changes suggested in this answer?

Normally I will stop the policy training when I see the logged success rate reaches ~85% at the most difficult level of curriculum (i.e. logged curr_max_disp=-0.01). You can check the logged success rate and current curriculum difficulty level on WandB after setting up your own account and run the training code with wandb_activate=True.

Best, Bingjie

wredsen commented 12 months ago

Hi Bingjie,

Have you tried the changes suggested in this answer?

I have, thanks for sharing this! With these changes the policy of the Peg-Hole task already converges with some reliable joining behaviour after just 200 episodes. After that it diverges before again assembling relative reliably at around 5000 episodes of training. Do you know an explanation for this very early convergence and divergence afterwards? Would it make sense to lower the tolerance of peg and hole given this low number of episodes to converge?

Normally I will stop the policy training when I see the logged success rate reaches ~85% at the most difficult level of curriculum (i.e. logged curr_max_disp=-0.01). You can check the logged success rate and current curriculum difficulty level on WandB after setting up your own account and run the training code with wandb_activate=True.

Also thanks for this note. But I don't understand the connection between the value of curr_max_disp and the success rate. Could you please elaborate on this?

Best regards, Konstantin

bingjietang718 commented 12 months ago

Hi Konstantin,

For the divergence, does it happen with multiple random seeds?

Decrease the tolerance of peg and hole will definitely increase the difficulty of the task, which may introduce extra challenge during policy learning. Without careful evaluation, we cannot predict if it will be simply solved by longer training time.

curr_max_disp corresponds to the $z^{low}$ in the sampling-based curriculum (you can find details about this in our paper). It indicates the current difficulty of curriculum. Essentially when the policy success rate reaches a certain threshold, the curriculum difficulty increases, and when the policy success rate drops below a threshold, the curriculum difficulty decreases. We want to ensure we end the training when it already reached the highest level of difficulty during training.

Best, Bingjie

wredsen commented 12 months ago

Hi Bingjie,

For the divergence, does it happen with multiple random seeds?

I just started another run. Is there anything to do for using another random seed or will this be choosen by default at every new training? Another training where I logged using wandb showed the same behaviour: Screenshot from 2023-10-11 10-46-16

Thanks for the explanation using curr_max_disp, after just a bit more than 1k steps its already at the hardest configuration of the curriculum and has a peak success rate of nearly 97% if I undestand these graphs correctly. After that, especially in the range of 15k to 20k steps the strategy quality seems to what I call "diverge".

Could you provide me with a wandb diagram of some successful training of yours?

Also here is a graph of the gear insertion task: Screenshot from 2023-10-11 10-56-09

It does not even reach another difficulty of the curriculum after running for more than 2 days. It should use the same replacement as the peg hole insertion uses, you mentioned earlier:

self.ctrl_target_fingertip_centered_pos with self.ctrl_target_fingertip_midpoint_pos in this line, and replace self.ctrl_target_fingertip_centered_quat with self.ctrl_target_fingertip_midpoint_quat in this line ?

Best regards, Konstantin

bingjietang718 commented 12 months ago

Hi Konstantin,

The issue you are running into could be originated from the unstable training of PPO. This paper offers an in-depth analysis of best practices of training with PPO. Early stop is an easy and effective approach in our case.

Best, Bingjie

ismarou commented 11 months ago

Hi Isidoros,

Can you revert my previously suggested fix (i.e. change back to _fingertip_centered_) and add the following code blocks here?

    def generate_ctrl_signals(self):
        """Get Jacobian. Set Franka DOF position targets or DOF torques."""
        # Get desired Jacobian
        if self.cfg_ctrl['jacobian_type'] == 'geometric':
            self.fingertip_midpoint_jacobian_tf = self.fingertip_centered_jacobian
        elif self.cfg_ctrl['jacobian_type'] == 'analytic':
            self.fingertip_midpoint_jacobian_tf = fc.get_analytic_jacobian(
                fingertip_quat=self.fingertip_quat,
                fingertip_jacobian=self.fingertip_centered_jacobian,
                num_envs=self.num_envs,
                device=self.device)
        # Set PD joint pos target or joint torque
        if self.cfg_ctrl['motor_ctrl_mode'] == 'gym':
            self._set_dof_pos_target()
        elif self.cfg_ctrl['motor_ctrl_mode'] == 'manual':
            self._set_dof_torque()

    def _set_dof_pos_target(self):
        """Set Franka DOF position target to move fingertips towards target pose."""
        self.ctrl_target_dof_pos = fc.compute_dof_pos_target(
            cfg_ctrl=self.cfg_ctrl,
            arm_dof_pos=self.arm_dof_pos,
            fingertip_midpoint_pos=self.fingertip_centered_pos,
            fingertip_midpoint_quat=self.fingertip_centered_quat,
            jacobian=self.fingertip_midpoint_jacobian_tf,
            ctrl_target_fingertip_midpoint_pos=self.ctrl_target_fingertip_centered_pos,
            ctrl_target_fingertip_midpoint_quat=self.ctrl_target_fingertip_centered_quat,
            ctrl_target_gripper_dof_pos=self.ctrl_target_gripper_dof_pos,
            device=self.device)
        self.gym.set_dof_position_target_tensor_indexed(self.sim,
                                                        gymtorch.unwrap_tensor(self.ctrl_target_dof_pos),
                                                        gymtorch.unwrap_tensor(self.franka_actor_ids_sim),
                                                        len(self.franka_actor_ids_sim))
    def _set_dof_torque(self):
        """Set Franka DOF torque to move fingertips towards target pose."""
        self.dof_torque = fc.compute_dof_torque(
            cfg_ctrl=self.cfg_ctrl,
            dof_pos=self.dof_pos,
            dof_vel=self.dof_vel,
            fingertip_midpoint_pos=self.fingertip_centered_pos,
            fingertip_midpoint_quat=self.fingertip_centered_quat,
            fingertip_midpoint_linvel=self.fingertip_centered_linvel,
            fingertip_midpoint_angvel=self.fingertip_centered_angvel,
            left_finger_force=self.left_finger_force,
            right_finger_force=self.right_finger_force,
            jacobian=self.fingertip_midpoint_jacobian_tf,
            arm_mass_matrix=self.arm_mass_matrix,
            ctrl_target_gripper_dof_pos=self.ctrl_target_gripper_dof_pos,
            ctrl_target_fingertip_midpoint_pos=self.ctrl_target_fingertip_centered_pos,
            ctrl_target_fingertip_midpoint_quat=self.ctrl_target_fingertip_centered_quat,
            ctrl_target_fingertip_contact_wrench=self.ctrl_target_fingertip_contact_wrench,
            device=self.device)
        self.gym.set_dof_actuation_force_tensor_indexed(self.sim,
                                                        gymtorch.unwrap_tensor(self.dof_torque),
                                                        gymtorch.unwrap_tensor(self.franka_actor_ids_sim),
                                                        len(self.franka_actor_ids_sim))

The difference here is now the robot end-effector control target is calculated with the assumption that the gripper fingers are perfectly mirrored which aligns with the real-world condition.

Bingjie

Hi, Bingjie,

Thanks for the fix. It indeed allowed the insertion policy to train and reach in 8-10 hours a similar performance with what you report in the paper for simulation.

Best, Isidoros

P.S. Ofc, if I let it train for longer, I am also having the same unstable PPO behaviour that Konstantin reports above, for multiple seeds.

wredsen commented 11 months ago

Hi Konstantin,

The issue you are running into could be originated from the unstable training of PPO. This paper offers an in-depth analysis of best practices of training with PPO. Early stop is an easy and effective approach in our case.

Best, Bingjie

Hi Bingjie,

thanks for pointing this out. I tried to train the gear insertion task with the additional functions generate_ctrl_signals, _set_dof_pos_target, _set_dof_torque you proposed.

But it does not converge at all: Screenshot from 2023-10-26 09-33-50

Is there anything else I could try?

Best, Konstantin

bingjietang718 commented 11 months ago

Hi Konstantin,

I am not exactly sure what is happening. One thing you can try is to increase the number of points sampled on the gear mesh with this parameter in IndustRealTaskGearsInsert.yaml. This will essentially sample more points on the gear mesh for SDF reward query and an increased number of points will better capture the geometry of the gear.

I have attached my training plots below (with the default value). Screenshot from 2023-11-07 10-01-41

Best, Bingjie