Some questions about VITWrapper in network.py

gmberton / deep-visual-geo-localization-benchmark

Official code for CVPR 2022 (Oral) paper "Deep Visual Geo-localization Benchmark"

MIT License

183 stars 27 forks source link

Some questions about VITWrapper in network.py #15

Closed Kaacoinnn closed 1 year ago

Kaacoinnn commented 1 year ago

Hi there, I noticed the VitWrapper in network.py. Would you please show me the meaning of VitWrapper? why the vit backbone should pop the class token self.vit_model(x).last_hidden_state[:, 1:, :] when connect with netvlad or gem aggregation layer? ^ ^

ga1i13o commented 1 year ago

Hello, it was simply an implementation choice that we had to make since we were the first to apply transformers for geolocalization. The results in our paper are obtained with that configuration. The choice stems from the fact that ViT is trained to encorporate global information in the CLS token, whereas NetVLAD and GeM are designed to operate on local features from CNNs, thus without global features. However if you want to can try keeping the CLS even with those aggregations

Kaacoinnn commented 1 year ago

thx for replying!! I'll try to keep the CLS with Netvlad~~

the confuse thing is that if the dimension of the output from VIT cannot match the input dimension of NetVlad.

ga1i13o commented 1 year ago

in the same way as netvlad with cnn can accept any size of feature maps, it can accept any number of tokens with transformers. it only depends on the dimension of the token thus you should not have any problems