google-research / bigbird

Transformers for Longer Sequences
https://arxiv.org/abs/2007.14062
Apache License 2.0
563 stars 101 forks source link

Differences between ETC and BigBird-ETC version #26

Open lhl2017 opened 2 years ago

lhl2017 commented 2 years ago

@manzilz Thank you for sharing the excellent research. :)

I have two quick questions. If I missed some info in your paper, could you please let me know what I missed?

Q1. Is the Global-local attention method used in the BigBird-ETC version totally the same as the ETC paper, otherwise Longformer?
As I know, some special tokens(global tokens) only take full attention to the restricted sequences according to the ETC paper. For example, in the HotpotQA task, a paragraph token attends to all tokens within the paragraph. Also, a sentence token attends to all tokens within the sentence. ( I can't find about how [CLS] and question tokens take attention to. )

In Longformer, the special tokens between sentences take full attention to the context.

In BigBird paper(above of section 3), the author said

"we add g global tokens that attend to all existing tokens."

It seems to say the BigBird-ETC version is similar to Longformer. However, when the author mentioned differences between Longformer and BigBird-ETC, point to the reference as an ETC (in Appendix E.3). It makes me confused.

Q2. Is there a source code or a pre-trained model for the BigBird-ETC version? If you could share it used in your paper, I will really appreciate it!

I look forward to your response.