pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.28k stars 3.65k forks source link

[Question] link_pred.py supervised version? #2096

Open inferense opened 3 years ago

inferense commented 3 years ago

Hey @rusty1s,

Given a dataset with ground truth (graphs with full edge indices) would there need to be any fundamental changes on the architecture to train it supervised? (besides loss function and omitting the neg/pos split and negative sampling of the edges)

Thanks!

rusty1s commented 3 years ago

Not really, you continue to learn node embeddings via GNNs, but need to lift them to edge features afterwards for the final link prediction:

src, dst = edge_index
edge_embedding = torch.cat([node_embedding[src], node_embedding[dst]], dim=-1)
inferense commented 3 years ago

the pos/neg edge split stays as it is? not sure If I understand the gist of that in the supervised version, could you kindly elaborate on the implementation changes?

rusty1s commented 3 years ago

I'm not really sure what you mean. link_pred.py is supervised, as we train agaisnt ground-truth edges denoted by pos_edge_index and neg_edge_index (pos_edge_index is a subset of edges in edge_index). The remaining ones are used for validation and testing.

inferense commented 3 years ago

My understanding of the link_pred.py (on the planetoid cora dataset) is that it learns node embeddings based on which it adds additional edges with prob > 0, as the final edge index tensor is significantly larger than the original edge index tensor.

(input data edge index = torch.Size([2, 10556]) final edge index = torch.Size([2, 3245788]))

My question is related to training on a dataset with all edge indices in place, and completing edges when presented with a new graph with missing edges. So the final edge index would ideally be similar in size as if the val graph was complete. (currently its many times larger similarly to the example above)

Perhaps I'm missing something? new to GNNs, so appreciate the help!

rusty1s commented 3 years ago

That is actually not what link_pred.py is doing. I'm sorry for the miss-confusion. The related paper can be found here.

This example describes how to tackle a link prediction problem, where we want to find missing links in an incomplete graph. So first, we create this incomplete graph by randomly dropping edges (pos_train_edge_index), from which we want to learn node embeddings suitable to find the missing links. The final edge_index denotes a probability matrix that describes the probability of edge existence of all node pairs. One can then find potentially missing links by only keeping those links with high probability.

Your problem refers to an inductive link prediction problem, which can be tackled in a similar fashion. You train on an incomplete training graph, and apply the model on unseen incomplete graphs afterwards.

inferense commented 3 years ago

Thank you for the answer, that was my understanding of the current version of link_pred.py perhaps I've not phrased my initial question properly.

I was thinking about dropping edges from my training set to do exactly what you've described, but I guess modifying the model to train on incomplete graphs (dropping edges from my training data) against the ground truth (my original training data with full edge indices) would result in better performance and I was just curious what modifications would such model need to have from the current link_pred.py

rusty1s commented 3 years ago

I do not think there are any modifications necessary for training the model. For inference, you will need to compute embeddings for your test nodes though, from which you can then compute link probabilities:

z = model.encode(test_data.x, test_data.edge_index)
test_prob = (z.unsqueeze(0) * z.unsqueeze(1)).sum(dim=-1)

where test_prob has shape [num_test_nodes, num_test_nodes].