Closed zivnachum closed 5 days ago
Thanks for your question!
Compared to the standard adapter architecture, the ladder network can save more memory, however, with the cost of poorer visual representation. Therefore, VisionTransformerAdapter is expected to perform better than VisionTransformerLadder.
The advantage of the ladder network usually becomes visible only if the backbone is very, very large, such as 1B or even 6B. In the meantime, the architecture design of the ladder network may also be improved for stronger performance.
Therefore, I recommend that you can simplely stick to the VisionTransformerAdapter, which is usually more effective.
Thanks!
Hi Shuming,
I used VisionTransformerAdapter on my own dataset, and the results are great! But when I use VisionTransformerLadder there's a big drop in the results. Do you have any suggestions what might cause this issue?
Thanks :)