Question About Similarity Formulation in Token Merge

rese1f / MovieChat

[CVPR 2024] 🎬💭 chat with over 10K frames of video!

https://rese1f.github.io/MovieChat/

BSD 3-Clause "New" or "Revised" License

454 stars 37 forks source link

Question About Similarity Formulation in Token Merge #34

Closed zhang9302002 closed 6 months ago

zhang9302002 commented 6 months ago

Hello, in your paper the similarity is calculated as: $$s=\frac{1}{N}\sum_{j=1}^{N} [\cos (xi^j,x{i+1}^j)]$$

However, according to this_code, the similarity is calculated indeed as: $$s=\frac{1}{N^2}\sum{j=1}^{N} \sum{k=1}^{N} [\cos (xi^j,x{i+1}^k)]$$

It confused me. Which one is the right case, or do I miss some information? Thank you.

Espere-1119-Song commented 6 months ago

$i$ stands for the $i$-th frames, and $j$ stands for the token index，aiming to calculating the average similarity of each respective tokens between two frames. The specific similarity calculating code is referred to TOME.

zhang9302002 commented 6 months ago

Your code calculates the average similarity of all token pairs between two frames $(x^j, y^k)$, instead of each respective pairs $(x^j, y^j)$. I think this contradicts with the paper.

Espere-1119-Song commented 6 months ago

Thanks for your valuable advise and we have fixed the bug in our code. The current code also provides the same result of our paper.