thu-ml / RoboticsDiffusionTransformer

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
MIT License
454 stars 39 forks source link

Fine-Tuning in Relative Action Space #14

Open lakomchik opened 1 week ago

lakomchik commented 1 week ago

I woule like for fine-tune RDT in a relative action space and have a question regarding the best method for mapping actions and proprieception.

Question: For fine-tuning a model in relative action space, would it be preferable to:

Using relative actions results in a smaller range for proprioception and action values. I’m curious if normalizing these values could help them better align with the model’s expected action space.

csuastt commented 1 week ago

Map them into the velocity slots in the unified action space (e.g., delta eef positions should be in eef position velocities). You could do normalization when fine-tuning.

lakomchik commented 6 days ago

@csuastt Thank you for your answer!

budzianowski commented 3 days ago

@csuastt - all preprocessing scripts are using eef_delta_pos_x (for example: https://github.com/thu-ml/RoboticsDiffusionTransformer/blob/414715e34e6734ef4563f806114ff37752e0eb58/data/preprocess_scripts/bridgev2.py#L128 (all the other preprocess_scripts for OXE as well). I can't find the place where these slots are formatted to eef_vel_x?

alik-git commented 3 days ago

@csuastt Follow up question: In the example above you can see the model is predicting eef_ang_x, y, z, w for a quaternion, but I don't see a way to map these directly to velocities, because in the STATE_VEC_IDX_MAPPING all the angular velocities are roll pitch raw only, see here

https://github.com/thu-ml/RoboticsDiffusionTransformer/blob/414715e34e6734ef4563f806114ff37752e0eb58/configs/state_vec.py#L62

Could you please clarify which indices and how exactly you are mapping the quaternion eef_ang_x, y, z, w to STATE_VEC_IDX_MAPPING. That would be greatly appreciated. Thank you!

csuastt commented 3 days ago

@alik-git @budzianowski Sorry, it is our mistake:( In the current implementation, we do not use any action in TFDataset. We use future states instead. To use action, you may need to do some modifications:

  1. In this line, remove the function converting RPY to quat, it was a mistake and we forgot to delete:

https://github.com/thu-ml/RoboticsDiffusionTransformer/blob/414715e34e6734ef4563f806114ff37752e0eb58/data/preprocess_scripts/bridgev2.py#L115

The original action is already delta RPY, which is the angle velocity.

  1. you may need to modify the follow-up preprocessing script to make the producer generate the action instead of future states. See this readme:

https://github.com/thu-ml/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md?plain=1#L242

budzianowski commented 2 days ago

Thanks for the prompt reply, this is very helpful! One more question - the model used in the demos from the paper is also following the this logic or the finetuning was performed with the modified logic?

ethan-iai commented 2 days ago

To clarify, in the demos mentioned in the paper, we predict the actions rather than future states. It depends on your robot, in our robot (ALOHA), the future states and actions are different. Please let me know if you’d like further details!

budzianowski commented 1 day ago

@ethan-iai Thanks for helpful explanation! If that's the case I'm still puzzled by the finetuning agilex setup where the actions are used?

alik-git commented 1 day ago

@ethan-iai @csuastt I just want to clarify when you say "predict the actions" are you saying that the neural network model directly outputs action deltas as the logits? or are you saying that the model directly outputs future states, and then you use that to manually compute the action deltas (future_state - current_state = action_deltas)?

The reason I ask is that during pretraining the model predicts future states as the logits (please correct me if that's wrong), so why not keep that consistent during fine tuning as well?

Just for context: we are trying to evaluate RDT on controlling a widowx robot arm. We are wondering if during our finetuning, it would be better to have the gt labels be the future states, and then compute the action deltas manually during deployment, or finetune with action deltas directly as the gt_labels. My naive assumption was that finetuning with action deltas directly would be worse since the model has to relearn more stuff (due to differences in scale and representation, e.g., smaller ranges for action deltas compared to joint positions).

But if you finetuned directly on action deltas (and empirically found it to be better) then we should reconsider our approach of finetuning on future states. Sorry for the long question, I just wanted to be extra clear in what the confusion is. Thank you for your time in answering all these questions, we greatly appreciate it!