Closed lorenwel closed 1 year ago
Hi @lorenwel good question! All end-to-end diffusion policy experiments are actually trained with hybrid_image_policy, which directly loads robomimic's vision encoder instead using the image_obs_encoder you saw. Their implementation of ResNet18 encoder has SpatialSoftmax pooling enabled https://github.com/ARISE-Initiative/robomimic/blob/b5d2aa9902825c6c652e3b08b19446d199b49590/robomimic/models/base_nets.py#L705.
Our implementation of vision encoder is only used for pretrained vision backbone (r3m, imagenet) experiments.
Got it! Thanks for the swift response.
Hi, thank you for your beautiful code :heart:
In section III.B of your paper you mention that you replace the global average pooling in the ResNet with a spatial softmax. However, I cannot find where this is done in your code.
I can only see where you change the batch norm for a group norm https://github.com/columbia-ai-robotics/diffusion_policy/blob/0d00e02b45e9e3f37f4eeb68bff076b68d9e9d44/diffusion_policy/model/vision/multi_image_obs_encoder.py#L62-L69
and where you remove the fully connected final layer https://github.com/columbia-ai-robotics/diffusion_policy/blob/0d00e02b45e9e3f37f4eeb68bff076b68d9e9d44/diffusion_policy/model/vision/model_getter.py#L15
but not where you change the average pooling.
Am I missing something or did you actually use average pooling, contrary to what's stated in the paper?