pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.38k stars 3.67k forks source link

Not all LINKX datasets are available #4569

Open OlegPlatonov opened 2 years ago

OlegPlatonov commented 2 years ago

🚀 The feature, motivation and pitch

Hi! I've noticed that PyG now has datasets from the “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” paper, however, for some reason not all datasets proposed in the paper are provided. It would be great if all the other datasets from the paper (pokec, genius, wiki, etc.) were added. There are not many heterophilous graph datasets and it will be very useful to have all of them in one place.

Alternatives

No response

Additional context

No response

Padarn commented 2 years ago

Hey @OlegPlatonov I've opened a PR here to address this https://github.com/pyg-team/pytorch_geometric/pull/4570. Please take a look and let me know what you think. Happy to have your input especially on the features for the deezer-europe dataset.

OlegPlatonov commented 2 years ago

Hi @Padarn! I'm afraid there is a slight misunderstanding. The repository for the “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” paper stores some datasets from prior works and these are the datasets you have added, however, they are already present in PyG (for example, here and here). What I've meant were the new datasets introduced in the paper, specifically pokec, arXiv-year, snap-patents, genius, twitch-gamers and wiki.

Padarn commented 2 years ago

Oh I see, thanks for the clarification! I didn't look carefully enough at what already exists. I'll update tomorrow.

On Sat, 30 Apr 2022, 9:21 pm OlegPlatonov, @.***> wrote:

Hi @Padarn https://github.com/Padarn! I'm afraid there is a slight misunderstanding. The repository for the “Large Scale Learning on Non-Homophilous Graphs: New Benchmarks and Strong Simple Methods” https://arxiv.org/abs/2110.14446 paper stores some datasets from prior works and these are the datasets you have added, however, they are already present in PyG (for example, here https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.WikipediaNetwork and here https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html#torch_geometric.datasets.DeezerEurope). What I've meant were the new datasets introduced in the paper, specifically pokec, arXiv-year, snap-patents, genius, twitch-gamers and wiki.

— Reply to this email directly, view it on GitHub https://github.com/pyg-team/pytorch_geometric/issues/4569#issuecomment-1113988251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGRPN2RLRWDHJJWAVMP7L3VHUXVTANCNFSM5UW3LCSQ . You are receiving this because you were mentioned.Message ID: @.***>

--

By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/ https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.

Padarn commented 2 years ago

Hey @OlegPlatonov - actually many of the datasets are are already in PyG. For example 'twitch-gamer's is available here.

However in the paper they say they have updated some of these datasets.

Most of these datasets have been used for evaluation of graph machine learning models in past work; we make adjustments such as modifying node labels and adding node features that allow for evaluation of GNNs in non-homophilous settings. We define node features for Pokec, genius, and snap-patents, and we also define node labels for arXiv-year, snap-patents, and genius. Additionally, we crawl and clean the large-scale wiki dataset — a new Wikipedia dataset where the task is to predict page views, which is non-homophilous with respect to the graph of articles connected by links between articles (see Appendix D.3). This wiki dataset has 1,925,342 nodes and 303,434,860 edges, so training and inference require scalable algorithms.

The only one totally new is the wiki dataset as far as I can tell.

I've updated the MR to just add 'genius' which did seem to be missing before.

OlegPlatonov commented 2 years ago

Hey @Padarn - indeed most datasets from the paper are not entirely new, but unless I'm missing something, they are not available in PyG (at least not in the form used in the paper). I've just checked and could not find arXiv-year, snap-patents, genius and wiki datasets in PyG. pokec dataset is available here, but it does not contain node features that were defined in the paper. As for twitch dataset, the version in PyG is a collection of 6 different graphs, which is different from the single twitch-gamers graph used in the paper (the number of nodes does not match).

Padarn commented 2 years ago

Yeah understand, they're not all there and some are updated in the paper, it just wasn't immediately clear what the best thing to do was with updated datasets that we already have.

Maybe it's easier to tackle them separately across a few PRs? I have one open for genius now, maybe we could prioritize the others?

rusty1s commented 2 years ago

Thanks @Padarn for your work on adding some of these datasets. I think adding the remaining one is definitely of interest to the community, especially in order to accelerate GNN research on heterophily graphs. Let's try to tackle this in follow-up PRs.

Padarn commented 2 years ago

Yep aligned. I think adding wiki from the paper is the highest priority.