snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.89k stars 397 forks source link

Training GNN on entire adj instead of training_set_adj #316

Closed nabihach closed 2 years ago

nabihach commented 2 years ago

I'm looking at this line of code: https://github.com/snap-stanford/ogb/blob/c8f0d2aca80a4f885bfd6ad5258ecf1c2d0ac2d9/examples/linkproppred/collab/gnn.py#L107

My question is, why are we training the GNN model on the entire adjacency matrix instead of just the training set's adjacency matrix? If the GNN can see the val and test edges, that information will be incorporated into the GNN's hidden states. Therefore when we send those hidden states to the LinkPredictor to predict presence/absence of edges, wouldn't that be considered data leakage?

weihua916 commented 2 years ago

Great point. data.adj_t only contains training edge information.

nabihach commented 2 years ago

@weihua916 Can you please explain how that is so? I am seeing in Lines 221--225 that the edges are not being split when creating the variable data:

    dataset = PygLinkPropPredDataset(name='ogbl-collab')
    data = dataset[0]
    edge_index = data.edge_index
    data.edge_weight = data.edge_weight.view(-1).to(torch.float)
    data = T.ToSparseTensor()(data)
weihua916 commented 2 years ago

data by default only contain training edges. The val/test edges can be accessed as

split_edge = dataset.get_edge_split()
nabihach commented 2 years ago

Got it. Thanks!