real-stanford / diffusion_policy

[RSS 2023] Diffusion Policy Visuomotor Policy Learning via Action Diffusion
https://diffusion-policy.cs.columbia.edu/
MIT License
1.1k stars 206 forks source link

Spatial softmax instead of global avg pooling? #9

Closed lorenwel closed 1 year ago

lorenwel commented 1 year ago

Hi, thank you for your beautiful code :heart:

In section III.B of your paper you mention that you replace the global average pooling in the ResNet with a spatial softmax. However, I cannot find where this is done in your code.

I can only see where you change the batch norm for a group norm https://github.com/columbia-ai-robotics/diffusion_policy/blob/0d00e02b45e9e3f37f4eeb68bff076b68d9e9d44/diffusion_policy/model/vision/multi_image_obs_encoder.py#L62-L69

and where you remove the fully connected final layer https://github.com/columbia-ai-robotics/diffusion_policy/blob/0d00e02b45e9e3f37f4eeb68bff076b68d9e9d44/diffusion_policy/model/vision/model_getter.py#L15

but not where you change the average pooling.

Am I missing something or did you actually use average pooling, contrary to what's stated in the paper?

cheng-chi commented 1 year ago

Hi @lorenwel good question! All end-to-end diffusion policy experiments are actually trained with hybrid_image_policy, which directly loads robomimic's vision encoder instead using the image_obs_encoder you saw. Their implementation of ResNet18 encoder has SpatialSoftmax pooling enabled https://github.com/ARISE-Initiative/robomimic/blob/b5d2aa9902825c6c652e3b08b19446d199b49590/robomimic/models/base_nets.py#L705.

Our implementation of vision encoder is only used for pretrained vision backbone (r3m, imagenet) experiments.

lorenwel commented 1 year ago

Got it! Thanks for the swift response.