Closed Jmh0527 closed 1 day ago
Thanks for your question!
In fact, we ablated the kernel size and tried conv3d (see Table 13 in supplementary). We find that introducing spatial convolution would decrease the performance and increase the memory usage. We suspect that the original backbone has effectively handled the spatial context, and additional spatial processing could potentially disrupt the pretrained knowledge.
Actually, I think that when using dwconv, a conv with a 1x1 kernel should be applied before to capture information between channels. I recall a paper that employed this approach, although I can't remember the specific one. In the current design, 1x1 conv is used after dw conv. What do you think?
To be honest, I guess the order of conv1x1 will only have a minor impact on performance. But still, you can easily ablate this in the experiments.
Thanks for your excellent work. I notice that conv1d is used in the adapter of adatad (opentad/models/backbone/vit_adapter). Why didn't you try conv3d?