Question regarding the link prediction datasets

devnkong commented 3 years ago

Hi, I'm a bit confused about the relationship between the training edge set and the graph adjacency we load for link tasks.

My understanding is that, the two are disjoint sets. The adjacency is for you to do message passing, while the training edges is for you to make predictions and compute loss. I have no trouble with all of the full-batch examples, but when I look at the code below: https://github.com/snap-stanford/ogb/blob/048ef636fda3d0c4a25b108651bf1d43050fa7ae/examples/linkproppred/citation2/cluster_gcn.py#L93

It seems that you directly compute losses on the edges from the adjacency you have done message passing on (the train function does not use the split_edge['train'] at all). Is this an analogy of the training edges, or the adjacency and the training edges are identical (I checked the shapes of the two it seemed that they are not identical)?

Looking forward to your reply!

weihua916 commented 3 years ago

Hi! Great point. data.edge_index is the same as split_edge['train'] and not disjoint (data.edge_index is made undirected to allow bi-directional message passing for undirected graph datasets). In our example code, we indeed perform message passing over data.edge_index and try to predict split_edge['train'].

If you want to perform message passing over a subset of training edges, you need to explicitly write that code.

devnkong commented 3 years ago

Thanks for the prompt reply! That's crystal clear!

snap-stanford / ogb

Question regarding the link prediction datasets #213