microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.62k stars 220 forks source link

question about Fig 4 in EfficientViT #178

Closed ysj9909 closed 1 year ago

ysj9909 commented 1 year ago

Thanks for sharing your code of work, In Figure 4, could you explain how the similarity with each head was calculated?

Thank you!

xinyuliu-jeffrey commented 1 year ago

Hi @ysj9909 ,

Thanks for your interest in our work. For each layer, the maximum cosine similarity value is Max_i(Max_j(CosineSim(feat(i-th head), feat(j-th head)))) (j!=i). Then, the value is averaged for all batches.

Best, Xinyu