What is the principle of exchanging the first two dimensions when calculating QKV attention?

qitianwu / NodeFormer

The official implementation of NeurIPS22 spotlight paper "NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification"

286 stars 27 forks source link

When reading the source code of NodeFormer, I found that when calculating QKV attention, the first and second dimensions of query/key/value were exchanged, such as lines 169-171 of nodeformer.py. After calculating attention, the first two dimensions were exchanged again when performing normalization. At first, I thought this work was unnecessary until I commented out the code and discovered a program memory overflow. Therefore, I am very curious about the principle of this step. Does placing the _nodenumber in the second dimension affect the complexity of matrix multiplication when calculating the dot product of key and value? Therefore, the _nodenumber was placed in the first dimension in advance.

qitianwu / NodeFormer

What is the principle of exchanging the first two dimensions when calculating QKV attention? #10