Closed Kaacoinnn closed 1 year ago
Hello, it was simply an implementation choice that we had to make since we were the first to apply transformers for geolocalization. The results in our paper are obtained with that configuration. The choice stems from the fact that ViT is trained to encorporate global information in the CLS token, whereas NetVLAD and GeM are designed to operate on local features from CNNs, thus without global features. However if you want to can try keeping the CLS even with those aggregations
thx for replying!! I'll try to keep the CLS with Netvlad~~
the confuse thing is that if the dimension of the output from VIT cannot match the input dimension of NetVlad.
in the same way as netvlad with cnn can accept any size of feature maps, it can accept any number of tokens with transformers. it only depends on the dimension of the token thus you should not have any problems
Hi there, I noticed the VitWrapper in network.py. Would you please show me the meaning of VitWrapper? why the vit backbone should pop the class token
self.vit_model(x).last_hidden_state[:, 1:, :]
when connect with netvlad or gem aggregation layer? ^ ^