Open BMEI1314 opened 2 years ago
The setting of performance of Clip4clip in Table 1 is based on VIT-B/32, while the LocVTP adopts Vit-B/16 to init the vis encoder. We want to know if it's a typo in paper
We take Vit-B/16 as the visual encoder following OA-Trans. There will indeed be some unfair comparisons with Clip4clip.
The setting of performance of Clip4clip in Table 1 is based on VIT-B/32, while the LocVTP adopts Vit-B/16 to init the vis encoder. We want to know if it's a typo in paper