lucidrains / h-transformer-1d

Implementation of H-Transformer-1D, Hierarchical Attention for Sequence Learning
MIT License
154 stars 21 forks source link

Does h-transformer-1d have a subquadratic cost? #7

Closed jinmang2 closed 3 years ago

jinmang2 commented 3 years ago

In h-transformer-1d paper, google researchers said,

1. Introduction
...
In this paper, we draw inspiration from two branches in numerical analysis:
Hierarchical Matrixs (H-Matrix) and Multigrid method.
We propose a hierarchical attention that has linear complexity in run time and memory, ...

In section 6.2, they explained why h-transformer-1d has a linear complexity in depth!!

My question is, Why subquadratic cost?? Is this different versus linear complexity??

In linformer paper, they also propose linear complexity transformer model. However, In some sources(e.g. https://andre-martins.github.io/docs/dsl2020/attention-mechanisms.pdf slide number 107), Proposed attention to solving the quadratic bottleneck problem is often called subquadratic self-attention. (e.g. BigBird, Linformer, Linear Transformer, Performer)

What is difference between subquadratic cost and linear complexity on time and space? Can I understand that Luna also has a subquadratic cost?

lucidrains commented 3 years ago

they are actually the same! linear complexity refers to O(N) but, subquadratic refers to any O that is less than N ** 2, which O(N) fits the bill (as well as other families of efficient attention)