lucidrains / h-transformer-1d

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning
MIT License
155 stars 21 forks source link

Application to sequence classification? #12

Closed trpstra closed 3 years ago

trpstra commented 3 years ago

Hi,

Forgive the naive question, I am trying to make sense of this paper but it's tough going. If I understand correctly, this attention mechanism focuses mainly on nearby tokens and only attends to distant tokens via a hierarchical, low-rank approximation. In that case, can the usual sequence classification approach of having a global [CLS] token that can attend to all other tokens (and vice versa) still work? If not, how can this attention mechanism handle the text classification tasks in the long range arena benchmark?

Cheers for whatever insights you can share, and thanks for the great work!

lucidrains commented 3 years ago

@onclue hey! so there is an emerging practice where you don't attach the [CLS] token at the beginning, but use attention pooling with the [CLS] token across the sequence at the very end

https://github.com/lucidrains/vit-pytorch/blob/main/vit_pytorch/cait.py#L175 https://github.com/lucidrains/perceiver-pytorch/blob/main/perceiver_pytorch/perceiver_io.py#L179

Highly recommend that approach to overcome your problem!

lucidrains commented 3 years ago

and yes, you are right, you can't use [cls] token in that manner for this specific architecture

trpstra commented 3 years ago

Thanks a lot. I was not familiar with this approach but it seems to make sense. I will try it out. Cheers!

junyongyou commented 3 years ago

@onclue Have you found a solution to this issue? Thanks a lot.