hustvl / Vim

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Apache License 2.0
2.55k stars 159 forks source link

about the position of cls token #58

Open ericzw opened 2 months ago

ericzw commented 2 months ago

Why the middle cls token is the best? if so, it seems that the cls token just captures the half tokens of sequence because the SSM is ordered while the Transformer is not ordered.

zhuqiangLu commented 2 months ago

I believe this is due to the bidirectional patch traversal is used. However, I do agree with the part that putting the cls token in the middle of the sequence does not sound optimal.

poult-lab commented 2 months ago

do you know why the author introduced the concepts(head class token, double class token, middle class token)?