snap-stanford / ogb

Benchmark datasets, data loaders, and evaluators for graph machine learning
https://ogb.stanford.edu
MIT License
1.93k stars 397 forks source link

Problem of using undirected graph in ogbn-arxiv, ogbn-papers100m and ogbn-mag #58

Closed HuXiangkun closed 4 years ago

HuXiangkun commented 4 years ago

Hi guys,

I see the code on the leaderboard are using undirected graphs for ogbn-arxiv, ogbn-papers100m and ogbn-mag, and I have a question about it.

The nodes in the three datasets are split by time (year of publication), however, using undirected graph may cause data leakage: we cannot predict the property of older papers using newer papers. So simply adding reverse edges for the full graph is not reasonable.

One possible way of doing this is to make papers in one year only see papers in the current year and previous years. Is it correct?

Many thanks.

weihua916 commented 4 years ago

Hi, this is a great point! In the strict sense, you are right; we indeed cannot use features of future papers to predict labels. In OGB, we do not impose that constraint for the sake of simplicity. At least for test nodes (corresponding to the newest papers), the labels are predicted based only on the papers from the same year and the previous years. At training time, researchers should feel free to exploit the temporal information so that their models generalize well at test time, in which models can only see past papers. Hope this answers your question.

HuXiangkun commented 4 years ago

Understand, thank you for the replay!