Closed tingxueronghua closed 2 years ago
Hi @tingxueronghua, thanks for your interest in our work! See this issue regarding ViT-B/16. The number of layers we consider expand the context of the tokens from the last layers. For CLIP with ViT-B/32 the context expansion has little to no negative effect, but for ViT-B/16 people noticed that the expanding context causes artifacts to be further highlighted, so I added support to control the level of context expansion. See the discussion on the issue for further details.
I hope I was able to answer your question, but feel free to ask for clarifications if you need them.
Best, Hila.
Thanks for your explanation! Sorry for bothering, but I still have another question. Given the "num_layers", I think we are assuming that the i th token of an intermediate feature could correspond to the i th token of the input? I have seen some other visualization library considering the intermediate layers, but I am not sure whether this is just an assumption.
Hi @tingxueronghua, feel free to ask anything :) If I understand your question correctly, you are indeed right. We assume that the i-th row of the attention matrix corresponds to the i-th token in the original sequence. When we do not propagate relevance all the way to the first attention layer, we indeed assume that the intermediate representation corresponds to the input token.
Thank you very much! I close this issue since all the questions have been answered.
I do not understand why the visualization of CLIP only calculates the last two layers, due to the existence of num_layers. Could you share some insights with us?
The existence of variable "num_layers" makes the heat map of CLIP clearer, but I think visualization of CLIP and ViT should be similar or at least comparable because they share the same architecture.