GNN Explainability Dataset Generation

🚀 The feature, motivation and pitch

Provide support for synthetic datasets commonly used in explainability papers.

This is part of the explainability roadmap #5520 .

Create a high-level API for the following functionalities (each individual sub-tasks will be specified in a different issue).

Synthetic datasets are often useful for explainability. While not always the most accurate benchmark for GNN explainability, they can be used to validate explainability algorithms, to debug models, and to provide groundtruths for certain evaluation such as identifying important subgraph structure.

Following GNNExplainer, the dataset construction has 2 parts:

A base graph (common ones include grid graphs, ER graphs, BA graphs etc.).
A set of motifs attached to the base graphs at various nodes in the base graph. The label for nodes is defined by whether it is part of the motif, so that a potentially good explanation for these nodes would just be the motif structure.

List of tasks:

[x] Overall framework and API for generating benchmark datasets (and their labels)
[x] Base graph creation with ER graph generator
[x] Base graph creation with BA graph generator
[x] Base graph creation with grid graph generator
[x] Motif generation (different motifs of size 3, 4, 5), and randomly attaching them to the base graph
[ ] GraphWorld dataset (https://github.com/google-research/graphworld)
[x] Infection benchmark (https://dl.acm.org/doi/10.1145/3447548.3467283)

PyG could create a general routine and framework to construct these datasets. For base graphs, aside from those mentioned, it is also worth considering KDD 2022 GraphWorld, which covers a wide range of structure characteristics and useful for real-world GNN research. For motifs, we can consider the motif atlas (for all size 4, 5, 6, 7 motifs ...). A user can construct a custom dataset by picking a random seed, a base graph generator, and a motif, and test the explainability performance by the ability to identify the selected motif as the important subgraph.

To ensure reproducibility, the dataset generator class will have an option to set a standard seed, a number of standard base graph generator and a number of motif patterns. As such, for research reproducibility, one can simply use the deterministic setting to obtain a standard set of benchmark datasets for explainability.

Alternatives

No response

Additional context

No response

pyg-team / pytorch_geometric