Adding an abstraction for offline sampling support with data prefetching.
Sampling can easily become the bottleneck in graph learning pipelines. Therefore, some users might prefer to do the sampling offline and store them. Assuming that node/edge features are going to be stored in a different resource, the user can use can get pre-sampled sub-graph and prefetch associated features and do the training on PyG.
Alternatives
Based on Slack discussions: Since PyG supports feature store, graph store, and sampler that operates on the graph store, PyG can handle this use-case: One can have the GraphStore as the DB with pre-sampled subgraphs, the FeatureStore stores the features, sampler returns pre-sampled instances. Then, the Loader is going to be defined by those three components.
Solution above works. However, if I am not missing something I have a concern: The solution sounds more like a workaround and terminology feels ambiguous. For instance, when GraphStore points to a DB with sampled subgraphs the store may not necessarily store the materialized graph, or sampler class does not do sampling anymore and instead of list of nodes to sample it should get a subgraph id etc.
Thanks for creating this issue. For in-memory classes, I think this is trivially solvable by just creating a NeighborLoader, fetch the mini-batches and store them on disk.
🚀 The feature, motivation and pitch
Adding an abstraction for offline sampling support with data prefetching.
Sampling can easily become the bottleneck in graph learning pipelines. Therefore, some users might prefer to do the sampling offline and store them. Assuming that node/edge features are going to be stored in a different resource, the user can use can get pre-sampled sub-graph and prefetch associated features and do the training on PyG.
Alternatives
Based on Slack discussions: Since PyG supports feature store, graph store, and sampler that operates on the graph store, PyG can handle this use-case: One can have the GraphStore as the DB with pre-sampled subgraphs, the FeatureStore stores the features, sampler returns pre-sampled instances. Then, the Loader is going to be defined by those three components.
Solution above works. However, if I am not missing something I have a concern: The solution sounds more like a workaround and terminology feels ambiguous. For instance, when GraphStore points to a DB with sampled subgraphs the store may not necessarily store the materialized graph, or sampler class does not do sampling anymore and instead of list of nodes to sample it should get a subgraph id etc.
Additional context
Seems that DGL has also the same issue as a feature request: https://github.com/dmlc/dgl/issues/4445