qitianwu / NodeFormer

The official implementation of NeurIPS22 spotlight paper "NodeFormer: A Scalable Graph Structure Learning Transformer for Node Classification"
286 stars 27 forks source link

What is the principle of exchanging the first two dimensions when calculating QKV attention? #10

Closed WithMeteor closed 1 year ago

WithMeteor commented 1 year ago

When reading the source code of NodeFormer, I found that when calculating QKV attention, the first and second dimensions of query/key/value were exchanged, such as lines 169-171 of nodeformer.py. After calculating attention, the first two dimensions were exchanged again when performing normalization. At first, I thought this work was unnecessary until I commented out the code and discovered a program memory overflow. Therefore, I am very curious about the principle of this step. Does placing the _nodenumber in the second dimension affect the complexity of matrix multiplication when calculating the dot product of key and value? Therefore, the _nodenumber was placed in the first dimension in advance.

WithMeteor commented 1 year ago

I may know the cause of the problem. When calculating the weight of the Adjacency Matrix, the dimension of the slice needs to be adjusted when obtaining the _queryend and _keystart. Change _queryprime[end] to _queryprime[:, end] and _keyprime[start] to _keyprime[:, start] in line 143 and 188. Change _attnnormalizer[end] to _attnnormalizer[:, end] in line 147 and 192 will solve the problem. This issue will be closed.