Loss settings when training with real Aloha robot dataset

a510721 commented 1 week ago

Thank you for your kind response all the time. If I want to proceed with learning based on the real aloha robot dataset, only the values of joint, action and camera images are saved. In simulation, we calculate action.nll_loss, guide-pt-left.2d_ce_loss, and guide-pt-right.2d_ce_loss. If I train using only action.nll_loss from the real robot dataset, will I get the same result? Can I exclude only guide-pt-left.2d_ce_loss and guide-pt-right.2d_ce_loss? I would be glad if you could also provide an explanation of loss(action.nll_loss, guide-pt-left.2d_ce_loss, and guide-pt-right.2d_ce_loss)

In creating arp.TokenType.make for guide_token_right and guide_token_left, I set is_control=True to prevent them from being calculated in the loss. I'm training with action.null_loss, but the loss value is negative. Is it okay? what does 'minus' mean?

INFO 2024-11-20 11:13:52 train.py:195 step:0 smpl:8 ep:0 epch:0.00 action.nll_loss:1.480 grdn:52.652 lr:1.0e-05 updt_s:0.845 data_s:0.397 INFO 2024-11-20 11:14:04 train.py:195 step:200 smpl:2K ep:4 epch:0.08 action.nll_loss:0.288 grdn:9.624 lr:1.0e-05 updt_s:0.062 data_s:0.000 INFO 2024-11-20 11:14:16 train.py:195 step:400 smpl:3K ep:8 epch:0.16 action.nll_loss:-0.280 grdn:15.151 lr:1.0e-05 updt_s:0.059 data_s:0.000 INFO 2024-11-20 11:14:28 train.py:195 step:600 smpl:5K ep:12 epch:0.24 action.nll_loss:-0.412 grdn:15.015 lr:1.0e-05 updt_s:0.060 data_s:0.000

mlzxy commented 1 week ago

In simulation, we calculate action.nll_loss, guide-pt-left.2d_ce_loss, and guide-pt-right.2d_ce_loss. If I train using only action.nll_loss from the real robot dataset, will I get the same result?

If you don't use the guide-pt loss, the overall performance will be lower than ACT. The guide-pt are 2d waypoints of the end-effectors. By predicting them in autoregression, the model learns better action prediction.

I uploaded an example in 2d-waypoints-real-robot.ipynb to show how to compute 2d position of robot joints (suppose you have a trajectory of actions, then you can select and subsample the eef 2d points as 2d waypoints). Suppose you have robot joint values, the examples show you how to (1) use forward kinematics to compute robot joint 3d xyz (2) use camera intrinsics+extrinsics to compute the robot joint 2d xyz.

We also have aloha/compute_waypoints.py. But this script obtains the 2d waypoints by directly communicating with the mujoco environment.

I would be glad if you could also provide an explanation of loss(action.nll_loss, guide-pt-left.2d_ce_loss, and guide-pt-right.2d_ce_loss) I'm training with action.null_loss, but the loss value is negative. Is it okay? what does 'minus' mean?

The action.nll_loss is the negative log likelihood loss (L781), that maximizing the likelihood of continuous actions in a mixtured-gaussian distribution. So yes, it can be negative, which means very high likelihood.

The guide ce losses are the cross entropy loss that applys on the 2d pixel coordinate actions. (L982)

All these losses are basically maximizing the likelihood of given action sequences.

Can I exclude only guide-pt-left.2d_ce_loss and guide-pt-right.2d_ce_loss?
In creating arp.TokenType.make for guide_token_right and guide_token_left, I set is_control=True to prevent them from being calculated in the loss.

Don't set them to is_control=True. The is_control=True tokens are designed for cases like language prompts, basically the information that already present before robot execution. Guide 2d points are still needed to be predicted.

a510721 commented 6 days ago

The action command performs regression based on gaussian_nll_loss, while guide-pt-left and guide-pt-right approach the point location in the image as a classification problem to calculate the overall loss.

Should I calculate guide-pt-left and guide-pt-right even when only robot grippers are visible in cam_left_wrist and cam_right_wrist, but the entire robot is not seen in the image? https://mobile-aloha.github.io/

mlzxy commented 6 days ago

Hi @a510721. You can truncate them to be within the image. I draw an image to illustrate the idea, the red dots are right arm waypoint, and green dots are left arm waypoints, the data is screenshot from https://drive.google.com/drive/folders/17r6CBQKgXEt_0DfvBrSJkyZQMi66TAz9.

In this case, suppose we want to grasp the bottle, both arm needs to move downwards, which means the trajectory go out of the image. You can clip them to be within the image, so they still provide a sense of direction.

Note that, by predicting and visualizing the 2d waypoints during training, you get a good sense of whether model "understand" how to solve the tasks.

a510721 commented 6 days ago

Is it correct that your idea is to use the robot arm trajectory only in the visible part of the image? If I'm misunderstanding, please explain in detail.

mlzxy commented 6 days ago

Is it correct that your idea is to use the robot arm trajectory only in the visible part of the image? If I'm misunderstanding, please explain in detail.

Not exact, I suggest to clip your 2d waypoints within the image, like below:

traj2d  # array of Nx2, where N is the number of waypoints, each waypoint in format x,y
# suppose your image size is h x w
traj2d[:, 0] = np.clip(traj2d[:, 0], 0, w - 1)
traj2d[:, 1] = np.clip(traj2d[:, 1], 0, h - 1)

mlzxy / arp

Loss settings when training with real Aloha robot dataset #11