thu-ml / RoboticsDiffusionTransformer

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
MIT License
491 stars 42 forks source link

[Question about Input Configuration in RDT-1b] Why Concatenate Action Noise, Proprio Z, and Frequency C? #9

Open Mark-98 opened 3 weeks ago

Mark-98 commented 3 weeks ago

Hello,

First of all, I wanted to thank you for writing such an insightful paper—it's been incredibly helpful for my research.

I have a question regarding the input configuration in the RDT-1b model, where action noise, proprio z, and frequency c are concatenated as inputs for reconstruction.

Typically, in diffusion models, the conditioning data is provided separately to guide the reconstruction process. However, in RDT-1b, proprio z and frequency c are not used as standalone conditions but are instead concatenated directly with the input.

Could you explain why this specific input configuration was chosen and whether it provides particular advantages during the diffusion process?

Thank you!

csuastt commented 3 weeks ago

We adopted an in-context conditioning for low-dimensional inputs. This is because they are very short and a cross-attention may be too heavy for them. We masked the gradients corresponding to these three conditions.

Mark-98 commented 2 weeks ago

Thank you for your answer.

I understand that you're using the in-context learning concept in the DiT (Diffusion Transformer).

I also reviewed your code, but I'm still having trouble understanding the gradient masking process.

Could you explain it a bit more? Specifically, could you clarify where exactly the gradient is being masked? Is it applied only within the DiT decoder, and if so, which parts are affected?