pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.08k stars 3.63k forks source link

Offline sampling support with data prefetching. #6010

Open ayasar70 opened 1 year ago

ayasar70 commented 1 year ago

🚀 The feature, motivation and pitch

Adding an abstraction for offline sampling support with data prefetching.

Sampling can easily become the bottleneck in graph learning pipelines. Therefore, some users might prefer to do the sampling offline and store them. Assuming that node/edge features are going to be stored in a different resource, the user can use can get pre-sampled sub-graph and prefetch associated features and do the training on PyG.

Alternatives

Based on Slack discussions: Since PyG supports feature store, graph store, and sampler that operates on the graph store, PyG can handle this use-case: One can have the GraphStore as the DB with pre-sampled subgraphs, the FeatureStore stores the features, sampler returns pre-sampled instances. Then, the Loader is going to be defined by those three components.

Solution above works. However, if I am not missing something I have a concern: The solution sounds more like a workaround and terminology feels ambiguous. For instance, when GraphStore points to a DB with sampled subgraphs the store may not necessarily store the materialized graph, or sampler class does not do sampling anymore and instead of list of nodes to sample it should get a subgraph id etc.

Additional context

Seems that DGL has also the same issue as a feature request: https://github.com/dmlc/dgl/issues/4445

rusty1s commented 1 year ago

Thanks for creating this issue. For in-memory classes, I think this is trivially solvable by just creating a NeighborLoader, fetch the mini-batches and store them on disk.