conv1d in vit_adapter - Githubissues

sming256 / OpenTAD

OpenTAD is an open-source temporal action detection (TAD) toolbox based on PyTorch.

Apache License 2.0

102 stars 5 forks source link

conv1d in vit_adapter #23

Closed Jmh0527 closed 1 day ago

Jmh0527 commented 4 days ago

Thanks for your excellent work. I notice that conv1d is used in the adapter of adatad (opentad/models/backbone/vit_adapter). Why didn't you try conv3d?

sming256 commented 4 days ago

Thanks for your question!

In fact, we ablated the kernel size and tried conv3d (see Table 13 in supplementary). We find that introducing spatial convolution would decrease the performance and increase the memory usage. We suspect that the original backbone has effectively handled the spatial context, and additional spatial processing could potentially disrupt the pretrained knowledge.

Jmh0527 commented 2 days ago

Actually, I think that when using dwconv, a conv with a 1x1 kernel should be applied before to capture information between channels. I recall a paper that employed this approach, although I can't remember the specific one. In the current design, 1x1 conv is used after dw conv. What do you think?

sming256 commented 2 days ago

To be honest, I guess the order of conv1x1 will only have a minor impact on performance. But still, you can easily ablate this in the experiments.