Closed RexYing closed 1 year ago
Here is a draft idea regarding a potential architecture for this issue:
# Interface for the user
generator = GraphWorld(…)
motif = MotifGenerator(...)
dataset = ExplainerDataset(generator=generator, motif, seed …)
class GraphWorld(GraphGenerator):
- process
class BAGraph(GraphGenerator):
- process
class GraphGenerator():
- generate labels
# Generate motif based on the chosen structure
class MotifGenerator():
class ExplainerDataset(InMemoryDataset): # data/synthetic_dataset.py
- attach motif
🚀 The feature, motivation and pitch
Provide support for synthetic datasets commonly used in explainability papers.
This is part of the explainability roadmap #5520 .
Create a high-level API for the following functionalities (each individual sub-tasks will be specified in a different issue).
Synthetic datasets are often useful for explainability. While not always the most accurate benchmark for GNN explainability, they can be used to validate explainability algorithms, to debug models, and to provide groundtruths for certain evaluation such as identifying important subgraph structure.
Following GNNExplainer, the dataset construction has 2 parts:
List of tasks:
PyG could create a general routine and framework to construct these datasets. For base graphs, aside from those mentioned, it is also worth considering KDD 2022 GraphWorld, which covers a wide range of structure characteristics and useful for real-world GNN research. For motifs, we can consider the motif atlas (for all size 4, 5, 6, 7 motifs ...). A user can construct a custom dataset by picking a random seed, a base graph generator, and a motif, and test the explainability performance by the ability to identify the selected motif as the important subgraph.
To ensure reproducibility, the dataset generator class will have an option to set a standard seed, a number of standard base graph generator and a number of motif patterns. As such, for research reproducibility, one can simply use the deterministic setting to obtain a standard set of benchmark datasets for explainability.
Alternatives
No response
Additional context
No response