Clarification about #edges in ogbl-collab

snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning

https://ogb.stanford.edu

MIT License

1.93k stars 397 forks source link

Clarification about #edges in ogbl-collab #87

Closed vymao closed 3 years ago

vymao commented 3 years ago

Hi,

Just looking for some clarification about the link prediction datasets. Per the example given on the website, for ogbl-collab, when I run dataset[0] I get 2358104 edges. This is more than what is stated in the description, yet this is stated to be the training edges (which it doesn't match with train_edge['edge']. Can you explain this?

weihua916 commented 3 years ago

Hi! What is on the website is the number of edges in train+validation+test sets (see the description of Table 3 in our paper).

vymao commented 3 years ago

I see, but then what is dataset[0] supposed to represent? Why are there more edges here (2358104) than in the test + training + validation?

>>> train_edge['edge'].size()
torch.Size([1179052, 2])
>>> train_edge['edge'].size()
KeyboardInterrupt
>>> valid_edge['edge'].size()
torch.Size([60084, 2])
>>> test_edge['edge'].size()
torch.Size([46329, 2])
>>> dataset[0].edge_index.size()
torch.Size([2, 2358104])

Does dataset[0] include each edge twice (with alternating start/end nodes) since it is an undirected graph?

weihua916 commented 3 years ago

Yes, that's correct. For all undirected graphs, we represent them as graphs with bidirectional edges. See Note here.