Closed nabihach closed 2 years ago
Great point. data.adj_t
only contains training edge information.
@weihua916 Can you please explain how that is so? I am seeing in Lines 221--225 that the edges are not being split when creating the variable data
:
dataset = PygLinkPropPredDataset(name='ogbl-collab')
data = dataset[0]
edge_index = data.edge_index
data.edge_weight = data.edge_weight.view(-1).to(torch.float)
data = T.ToSparseTensor()(data)
data
by default only contain training edges. The val/test edges can be accessed as
split_edge = dataset.get_edge_split()
Got it. Thanks!
I'm looking at this line of code: https://github.com/snap-stanford/ogb/blob/c8f0d2aca80a4f885bfd6ad5258ecf1c2d0ac2d9/examples/linkproppred/collab/gnn.py#L107
My question is, why are we training the GNN model on the entire adjacency matrix instead of just the training set's adjacency matrix? If the GNN can see the val and test edges, that information will be incorporated into the GNN's hidden states. Therefore when we send those hidden states to the LinkPredictor to predict presence/absence of edges, wouldn't that be considered data leakage?