Open Mark-98 opened 3 weeks ago
We adopted an in-context conditioning for low-dimensional inputs. This is because they are very short and a cross-attention may be too heavy for them. We masked the gradients corresponding to these three conditions.
Thank you for your answer.
I understand that you're using the in-context learning concept in the DiT (Diffusion Transformer).
I also reviewed your code, but I'm still having trouble understanding the gradient masking process.
Could you explain it a bit more? Specifically, could you clarify where exactly the gradient is being masked? Is it applied only within the DiT decoder, and if so, which parts are affected?
Hello,
First of all, I wanted to thank you for writing such an insightful paper—it's been incredibly helpful for my research.
I have a question regarding the input configuration in the RDT-1b model, where action noise, proprio z, and frequency c are concatenated as inputs for reconstruction.
Typically, in diffusion models, the conditioning data is provided separately to guide the reconstruction process. However, in RDT-1b, proprio z and frequency c are not used as standalone conditions but are instead concatenated directly with the input.
Could you explain why this specific input configuration was chosen and whether it provides particular advantages during the diffusion process?
Thank you!