pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.91k stars 3.61k forks source link

Graph Transformer Implementation #627

Open tawe141 opened 5 years ago

tawe141 commented 5 years ago

Hi,

Could you give me some pointers in implementing the Graph Transformer from here: https://openreview.net/pdf?id=HJei-2RcK7 ? or this one? https://arxiv.org/abs/1905.12712

From what I can see, PyTorch Geometric is designed to work on only binary adjacency matrices (whether an edge exists or not). Is this correct?

rusty1s commented 5 years ago

Not necessarily! Although edge_index is indeed binary (or sparse), you can encode edge weights or edge features in a different tensor of shape [num_edges, *] (commonly referred to as edge_attr or edge_weight) and do something like:

def message(self, x_j, edge_weight):
    return x_j * edge_weight

We do this all the time, e.g., with fixed edge weights in GCN, or learnable edge weights in GAT, or multi-dimensional edge features in NNConv. I think this distinguishing helps in understanding that gradients are only passed through the edge features, but not the edge indices.

Your suggested models can surely be implemented within PyTorch Geometric. Feel free to ask any specific questions you might encounter while doing so. The source code of the GATConv might be a good starting point.

tawe141 commented 5 years ago

Thanks. I'm having trouble coming up with a way to do batch-wise intra-graph self-attention the way it's detailed in the Graph Transformer paper. The only way I can think of doing it is by running through a for-loop over individual graphs and doing self-attention on the graph's set of nodes, but this seems very inefficient. Is there a better way to do that?

rusty1s commented 5 years ago

As far as I see, the intra-graph self-attention operator does only update the target node features by looking at its neighboring target node features and itself. It is therefore analogously to the GATConv, and batching is automatically supported because there a no connections between separate graphs.

In case you are talking about implementing a dense version, I would also process the graphs densely by the use of torch_geometric.utils.to_dense_batch and torch_geometric.utils.to_dense_adj, where feature matrices and adjacency matrices get converted to a dense version with an additional batch dimension where the largest number of nodes in the batch is used as node size across the batch. Be aware that you mask out invalid nodes for your computation with the help of the second return value of to_dense_batch.

ghost commented 5 years ago

hi @rusty1s , out of curiosity: why do you suggest to process the graphs densely ? Is this performance related ? Because as far as I know, you could still implement the graph transformer without doing so ?

rusty1s commented 5 years ago

This was just a speculation, since the intra-graph formulation of the Graph Transformer paper iterates over all nodes, not just its neighbors. This might be just not well formulated in the paper, but in case it really does take non-neighbors into account, processing the dense graph is the way to go.