In your paper you mention that "It is known that self-attention is not easily parallelizable on GPUs" (Window Size for Local Self-Attention in Ablation Experiments). I tried looking for works/sources that would mention this as an issue as well, but I couldn't find any. Could you maybe explain or provide sources for why self-attention is not easily parallelizable?
The text here was intended to refer to the difficulty of efficient implementation of self-attention on GPUs, which is mostly memory IO bounded. See these papers for some discussion [link1][link2].
Hi,
In your paper you mention that "It is known that self-attention is not easily parallelizable on GPUs" (Window Size for Local Self-Attention in Ablation Experiments). I tried looking for works/sources that would mention this as an issue as well, but I couldn't find any. Could you maybe explain or provide sources for why self-attention is not easily parallelizable?