Closed StOnEGiggity closed 2 years ago
Hi,
Thanks for raising this discussion! So, the hubness phenomena appears for retrieval models. In order to measure the performance, this is an important aspect relative to the testing split.
With the querybank, we are trying to estimate what videos are hubs, but in order to get an improvement in performance, we need to identify the potential hubs relative to the testing split. So, if there is a domain gap between the testing and the querybank it may lead to performance degradation depending on the normalization method used. We cover several cases in the paper in Table 2. There you can see that depending on what is the source of the queries from the querybank, there might be a degradation in performance. So, it is clear that the nature of the querybank has an impact on performance. However, we obtained a boost in performance on all dataset when we used all the queries from training.
I assume this domain gap appears in your case if using validation queries works, but using the training queries doesn't.
What happens if you use more/less queries from training? Are the sizes of the querybank comparable between training and validation? What happens if you combine validation and training? From our experience, it would benefit to have as many queries as possible in the querybank in order to mitigate this domain shift aspect.
Also, the beta parameter might influence the performance, so maybe it is worth trying to validate it on a separate split. CLIP4Clip scales the final similarity matrix (not sure if this is the case for you as well), so try using smaller beta values (in range of 0-2).
P.S. You can also find a more detailed discussion about hubness and QB-Norm here https://www.youtube.com/watch?v=KSgmKIbyA8M&t=3s.
Thanks!
Hi,
Thanks for your explanation. I will try the above suggestions as you said.
I believe a domain gap exists between the train and set, but it may not be the main reason. This is because we both utilize similar architecture, such as CLIP4Clip. I notice that CLIP2Video uses NetVLAD, a clustering method. Meanwhile, QB-Norm uses CLIP4Clip with a temporal transformer, including self-attention modules.
My experiments use mean-pooling to extract video-level features without extra clustering methods or attention mechanisms. Is there any connection between the hubness problem and specific architecture design (e.g., dot production/clustering)? I will try QB-Norm with related designs.
Thanks for your advice and explanation again :)
Hi,
Thanks for your open-source work.
I try QB-Norm for my customed model, which is based on CLIP4Clip. However, I find there exists performance degradation when I use 9K training split as the query bank. Meanwhile, I obtain the performance gain using Val split as the query bank.
I guess the assumption for the hubness problem in Figure 1 may need to be discussed more. In my experiments, the hub X2 is different in the Val query bank and training query bank. Maybe this is the reason why training query bank does not work.
I am unfamiliar with the hubness problem, look forward to a detailed discussion or suggestion. If I misunderstand the assumption, please tell me directly:) Thanks a lot.