vimalabs / VIMA

Official Algorithm Implementation of ICML'23 Paper "VIMA: General Robot Manipulation with Multimodal Prompts"
MIT License
778 stars 87 forks source link

the physical meaning of the parameter 'actions' #32

Closed kaixin-bai closed 1 year ago

kaixin-bai commented 1 year ago

In VIMA, the robot's final executed position is fed into env.step(actions) through the actions variable. The robot then calculates positions like prepick, postpick, preplace, and postplace. It appears that pose0_position and pose1_position in the actions dictionary represent the positions to which the robot will move in Cartesian coordinates, while pose0_rotation and pose1_rotation seem to be quaternions.

My question is how these variables are transformed into their final form. I noticed in the code above that you've defined some variables to scale these positional variables, such as _n_discrete_x_bins, _n_discrete_y_bins, etc.

Suppose I try the VIMA algorithm on a physical robot using my camera. How would my actions be transformed from the form below to the final grasping position? Also, the values below don't appear to be in pixel coordinates.

            """
            actions:
              'pose0_position': Tensor:(1,1,2) tensor([[[16, 35]]])
              'pose0_rotation': Tensor:(1,1,4) tensor([[[25, 25, 25, 49]]])
              'pose1_position': Tensor:(1,1,2) tensor([[[13, 85]]])
              'pose1_rotation': Tensor:(1,1,4) tensor([[[25, 25, 49, 19]]])
            """
            actions = {k: v.mode() for k, v in dist_dict.items()}
yunfanjiang commented 1 year ago

Thanks for your interest. To answer your questions:

My question is how these variables are transformed into their final form. I noticed in the code above that you've defined some variables to scale these positional variables, such as _n_discrete_x_bins, _n_discrete_y_bins, etc.

Our model predicts indices of discrete bins, which are subsequently converted to continuous values in the range of [0, 1]. These are then translated to the coordinates in the simulated workspace.

Suppose I try the VIMA algorithm on a physical robot using my camera. How would my actions be transformed from the form below to the final grasping position? Also, the values below don't appear to be in pixel coordinates.

Since the model outputs coordinates relative to camera frame, we need to transform to world frame to get grasping positions.