google-research / long-range-arena

Long Range Arena for Benchmarking Efficient Transformers
Apache License 2.0
710 stars 77 forks source link

Quadratic Longformer suspicion #44

Closed YegorKhodak closed 2 years ago

YegorKhodak commented 2 years ago

Hi! I am now checking the Longformer implementation and It seems that nn.attention.dot_product_attention() (with attention pattern passed through the bias parameter) does all the heavy lifting. But in the nn.attention.dot_product_attention() or a newer version linen.attention.dot_product_attention() the multiplication goes first and you only use the mask after the multiplication. So you still have the quadratic computation. Can you explain, please, how you bypass the quadratic computation?

vanzytay commented 2 years ago

Hi there,

Yes longformer and sparse transformer in our codebase is still quadratic. In order to get speed benefits from longformer one would need cuda kernels that do not work well with TPUs. Hence we consider them to be inconvenient for usage. The purpose of benchmarking here is to get a mathematically equivalent version to benchmark model quality. Note we do not benchmark memory and runtime of longformer and sparse transformer in our paper. Thanks.