Closed ysj9909 closed 1 year ago
Hi @ysj9909 ,
Thanks for your interest in our work. For each layer, the maximum cosine similarity value is Max_i(Max_j(CosineSim(feat(i-th head), feat(j-th head)))) (j!=i). Then, the value is averaged for all batches.
Best, Xinyu
Thanks for sharing your code of work, In Figure 4, could you explain how the similarity with each head was calculated?
Thank you!