graphdeeplearning / graphtransformer

Graph Transformer Architecture. Source code for "A Generalization of Transformer Networks to Graphs", DLG-AAAI'21.
https://arxiv.org/abs/2012.09699
MIT License
895 stars 136 forks source link

Memory consumption #17

Open GregorKobsik opened 2 years ago

GregorKobsik commented 2 years ago

Could you provide some additional information about the memory consumption using your Graph Transformer?

You state, that sparse-attention favors both computation time and memory consumption, but do not provide actual measurements of the second property in your evaluation or do not state clearly, if and how your implementation is able to take advantage of it. Some peak memory measurements of your experiments as an addendum to your evaluation of the computation times (e.g. Table 1) could be beneficial to others, too. As in my case the quadratic growth of the memory consumption w.r.t. the sequence length prevents an efficient use of Transformers for some task, where connectivity information is given and can be simply modeled by masking out (-Inf) the attention scores in the attention matrix.

Also some exemplary or artificial data could be interesting, e.g. (Mean) number of nodes n = {128, 1024, 2048, 4096}, (mean) number of edges per node e = {4, 8, 16, 64, 128}, to get an impression of the resource consumption of your Graph Transformer with Sparse Graph vs. NLP-Transformer (Full Graph with masking).

(I hope, that I could run the experiments myself, but I suppose your evaluation pipeline is already running and data provided by the original author should be more precise and trustworthy to other researchers, too.)

vijaydwivedi75 commented 2 years ago

Hi @GregorKobsik,

When we use spare Graph Transformer (using the original sparse adj matrix), the memory consumption is O(E). With the Fully-connected graph, the memory consumption becomes O(N^2), where N and E denote the number of nodes and edges respectively. Because it is straightforward with the above orders of memory consumption, we did not put actual memory consumed during training/evaluation.

In case of real-world world (sparse) graphs, E would be significantly smaller than N^2, hence the memory consumed in a Graph Transformer operating on the sparse graph would be smaller than the one operating on fully-connected graph (like the NLP Transformer).