Octo's action space comprises end-effector velocities, representing changes in ['x', 'y', 'z', 'yaw', 'pitch', 'roll', 'grasp']. I intend to assess the model's zero-shot capability in a simulator. Despite understanding the significant domain gap, my goal is to verify the pipeline's error-free operation. I'm utilizing RLBench, where I observed that the action space is defined by joint angles, differing from Octo's.
For questions:
Regarding zero-shot, should I compute the corresponding joint angles from Octo's outputs using inverse kinematics, ensuring model-environment action alignment? Is this correct?
Regarding few-shot, should the action space in demonstration be end-effector velocities rather than joint angles? I understand joint angles are directly observable, whereas end-effector velocities necessitate computational conversion.
Yes, inverse kinematics would be the way to go (though note that we are not conditioning Octo on action space definition, so 0-shot performance will likely be bad since the model is not familiar with the action space definition in RL bench)
Either way works -- you can finetune Octo with end-effector velocities, which is closer to the training data, but would require you to compute the target velocities for your training data; or you can reinitialize the action head during finetuning and then directly finetune to predict joint angle actions.
Octo's action space comprises end-effector velocities, representing changes in ['x', 'y', 'z', 'yaw', 'pitch', 'roll', 'grasp']. I intend to assess the model's zero-shot capability in a simulator. Despite understanding the significant domain gap, my goal is to verify the pipeline's error-free operation. I'm utilizing RLBench, where I observed that the action space is defined by joint angles, differing from Octo's.
For questions:
Regarding zero-shot, should I compute the corresponding joint angles from Octo's outputs using inverse kinematics, ensuring model-environment action alignment? Is this correct?
Regarding few-shot, should the action space in demonstration be end-effector velocities rather than joint angles? I understand joint angles are directly observable, whereas end-effector velocities necessitate computational conversion.
Thanks for your great work!