Hi, I am currently looking at implementing a diffusion model for policy learning and was very impressed by your work! I was wondering what components of your approach you found to be particularly important for good results? 3 things I specifically was curious about were:
I see you use EMA, did you find that the model predictions were particularly unimodal/overfit to recent training data without it?
Was the causal attention masking used in the transformer variant crucial in getting this architecture to work, or do you think simply decoding waypoints from a more BERT-style encoder architecture would work?
In the appendix it seems you used a particularly large model for the CNN variant and say that you always found larger CNN -> better performance. Was the performance of much smaller CNNs (e.g. ~10M) much worse?
I empirically found EMA to accelerate training (eval performance increases faster) and improve performance (by <5%), but the policy should "work" even without it.
I found the causal attention masking to be critical to get the transformer variant of diffusion policy to work. My suspicion is that when used without it, the model "cheats" by looking ahead into future end-effector poses, which is almost identical to the action of the current timestep.
I think the model capacity needed depends on task complexity (more complex task requires larger CNN). Reducing the number of training diffusion steps also reduces CNN capacity requirement at the expense of reduced action quality. ~10M CNN should still work with less than 10% performance penalty on benchmarks we have tested.
Hi, I am currently looking at implementing a diffusion model for policy learning and was very impressed by your work! I was wondering what components of your approach you found to be particularly important for good results? 3 things I specifically was curious about were: