sming256 / OpenTAD

OpenTAD is an open-source temporal action detection (TAD) toolbox based on PyTorch.
Apache License 2.0
106 stars 5 forks source link

VisionTransformerAdapter vs VisionTransformerLadder #24

Closed zivnachum closed 5 days ago

zivnachum commented 6 days ago

Hi Shuming,

I used VisionTransformerAdapter on my own dataset, and the results are great! But when I use VisionTransformerLadder there's a big drop in the results. Do you have any suggestions what might cause this issue?

Thanks :)

sming256 commented 6 days ago

Thanks for your question!

Compared to the standard adapter architecture, the ladder network can save more memory, however, with the cost of poorer visual representation. Therefore, VisionTransformerAdapter is expected to perform better than VisionTransformerLadder.

The advantage of the ladder network usually becomes visible only if the backbone is very, very large, such as 1B or even 6B. In the meantime, the architecture design of the ladder network may also be improved for stronger performance.

Therefore, I recommend that you can simplely stick to the VisionTransformerAdapter, which is usually more effective.

zivnachum commented 5 days ago

Thanks!