Closed trpstra closed 3 years ago
@onclue hey! so there is an emerging practice where you don't attach the [CLS] token at the beginning, but use attention pooling with the [CLS] token across the sequence at the very end
https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/cait.py#L175 https://github.com/lucidrains/perceiver-pytorch/blob/main/perceiver_pytorch/perceiver_io.py#L179
Highly recommend that approach to overcome your problem!
and yes, you are right, you can't use [cls] token in that manner for this specific architecture
Thanks a lot. I was not familiar with this approach but it seems to make sense. I will try it out. Cheers!
@onclue Have you found a solution to this issue? Thanks a lot.
Hi,
Forgive the naive question, I am trying to make sense of this paper but it's tough going. If I understand correctly, this attention mechanism focuses mainly on nearby tokens and only attends to distant tokens via a hierarchical, low-rank approximation. In that case, can the usual sequence classification approach of having a global [CLS] token that can attend to all other tokens (and vice versa) still work? If not, how can this attention mechanism handle the text classification tasks in the long range arena benchmark?
Cheers for whatever insights you can share, and thanks for the great work!