Open lakomchik opened 1 week ago
Map them into the velocity slots in the unified action space (e.g., delta eef positions should be in eef position velocities). You could do normalization when fine-tuning.
@csuastt Thank you for your answer!
@csuastt - all preprocessing scripts are using eef_delta_pos_x
(for example: https://github.com/thu-ml/RoboticsDiffusionTransformer/blob/414715e34e6734ef4563f806114ff37752e0eb58/data/preprocess_scripts/bridgev2.py#L128 (all the other preprocess_scripts for OXE as well).
I can't find the place where these slots are formatted to eef_vel_x
?
@csuastt Follow up question: In the example above you can see the model is predicting eef_ang_x, y, z, w for a quaternion, but I don't see a way to map these directly to velocities, because in the STATE_VEC_IDX_MAPPING
all the angular velocities are roll pitch raw only, see here
Could you please clarify which indices and how exactly you are mapping the quaternion eef_ang_x, y, z, w to STATE_VEC_IDX_MAPPING
. That would be greatly appreciated. Thank you!
@alik-git @budzianowski Sorry, it is our mistake:( In the current implementation, we do not use any action in TFDataset. We use future states instead. To use action, you may need to do some modifications:
The original action is already delta RPY, which is the angle velocity.
https://github.com/thu-ml/RoboticsDiffusionTransformer/blob/main/docs/pretrain.md?plain=1#L242
Thanks for the prompt reply, this is very helpful! One more question - the model used in the demos from the paper is also following the this logic or the finetuning was performed with the modified logic?
To clarify, in the demos mentioned in the paper, we predict the actions rather than future states. It depends on your robot, in our robot (ALOHA), the future states and actions are different. Please let me know if you’d like further details!
@ethan-iai Thanks for helpful explanation! If that's the case I'm still puzzled by the finetuning agilex setup where the actions are used?
@ethan-iai @csuastt I just want to clarify when you say "predict the actions" are you saying that the neural network model directly outputs action deltas as the logits? or are you saying that the model directly outputs future states, and then you use that to manually compute the action deltas (future_state - current_state = action_deltas)?
The reason I ask is that during pretraining the model predicts future states as the logits (please correct me if that's wrong), so why not keep that consistent during fine tuning as well?
Just for context: we are trying to evaluate RDT on controlling a widowx robot arm. We are wondering if during our finetuning, it would be better to have the gt labels be the future states, and then compute the action deltas manually during deployment, or finetune with action deltas directly as the gt_labels. My naive assumption was that finetuning with action deltas directly would be worse since the model has to relearn more stuff (due to differences in scale and representation, e.g., smaller ranges for action deltas compared to joint positions).
But if you finetuned directly on action deltas (and empirically found it to be better) then we should reconsider our approach of finetuning on future states. Sorry for the long question, I just wanted to be extra clear in what the confusion is. Thank you for your time in answering all these questions, we greatly appreciate it!
I woule like for fine-tune RDT in a relative action space and have a question regarding the best method for mapping actions and proprieception.
Question: For fine-tuning a model in relative action space, would it be preferable to:
Using relative actions results in a smaller range for proprioception and action values. I’m curious if normalizing these values could help them better align with the model’s expected action space.