pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
21.01k stars 3.62k forks source link

Smart mini-batch for node level tasks #784

Open rugezhao opened 4 years ago

rugezhao commented 4 years ago

🚀 Feature

I would appreciate a feature that implements smart minibatching algorithms like https://arxiv.org/pdf/1905.07953.pdf for node level tasks.

Motivation

I see currently batch_size and batch tensor are only concatenating graphs in the minibatch and records which graph the node corresponds to for all graphs in the batch. This works for graph level tasks very well but not a huge graph that we want to implement mini-batches by sampling the nodes and edges.

rusty1s commented 4 years ago

We have integrated such a feature with the NeighborSampler. You can see the reddit.py script for an example.

raphaelsulzer commented 4 years ago

Is there a way to use the NeighborSampler for sampling small graphs from multiple large graphs? For now my workaround is to concatenate a list of large graphs with the DataLoader and then simply get a first batch from the DataLoader and feed it into a NeighborSampler. However, what I cannot do anymore now is to apply a train/test mask on the subgraphs from the NeighborSampler. Is there any better way to achieve what I want to do?

rusty1s commented 4 years ago

This is interesting, and your solution should already work quite well. You can still use train/test masks the same as you filter your node features.

raphaelsulzer commented 4 years ago

Indeed the train/test masks still work the same of course. My bad. Thank you!

raphaelsulzer commented 4 years ago

Is there a way to use the NeighborSampler for sampling small graphs from multiple large graphs? For now my workaround is to concatenate a list of large graphs with the DataLoader and then simply get a first batch from the DataLoader and feed it into a NeighborSampler. However, what I cannot do anymore now is to apply a train/test mask on the subgraphs from the NeighborSampler. Is there any better way to achieve what I want to do?

I am still not sure about some of the behaviour of my pipeline. How do I do a correct normalization in this scenario (separately for train and test split)? If I understand the NeighborSampler correctly it will also sample nodes n_id which are in num_hops distance of my initial starting nodes b_id, but may not be in the train_mask, correct?

In other words: `data.train_mask = some_mask_defined_over_all_nodes_of_all_graphs

loader = NeighborSampler(data, size=5, num_hops=2, batch_size=10, shuffle=True, add_self_loops=True) batch_1 = loader.get_batches__(data.train_mask)[0] sub_graph_1 = loader.produce_subgraph__(batch_1)`

sub_graph_1 now includes nodes where data.train_mask = false, and thus I did not normalize these node features.