Open ericzw opened 2 months ago
I believe this is due to the bidirectional patch traversal is used. However, I do agree with the part that putting the cls token in the middle of the sequence does not sound optimal.
do you know why the author introduced the concepts(head class token, double class token, middle class token)?
Why the middle cls token is the best? if so, it seems that the cls token just captures the half tokens of sequence because the SSM is ordered while the Transformer is not ordered.