Open rugezhao opened 4 years ago
We have integrated such a feature with the NeighborSampler
. You can see the reddit.py
script for an example.
Is there a way to use the NeighborSampler for sampling small graphs from multiple large graphs? For now my workaround is to concatenate a list of large graphs with the DataLoader and then simply get a first batch from the DataLoader and feed it into a NeighborSampler. However, what I cannot do anymore now is to apply a train/test mask on the subgraphs from the NeighborSampler. Is there any better way to achieve what I want to do?
This is interesting, and your solution should already work quite well. You can still use train/test masks the same as you filter your node features.
Indeed the train/test masks still work the same of course. My bad. Thank you!
Is there a way to use the NeighborSampler for sampling small graphs from multiple large graphs? For now my workaround is to concatenate a list of large graphs with the DataLoader and then simply get a first batch from the DataLoader and feed it into a NeighborSampler. However, what I cannot do anymore now is to apply a train/test mask on the subgraphs from the NeighborSampler. Is there any better way to achieve what I want to do?
I am still not sure about some of the behaviour of my pipeline. How do I do a correct normalization in this scenario (separately for train and test split)? If I understand the NeighborSampler correctly it will also sample nodes n_id which are in num_hops distance of my initial starting nodes b_id, but may not be in the train_mask, correct?
In other words: `data.train_mask = some_mask_defined_over_all_nodes_of_all_graphs
loader = NeighborSampler(data, size=5, num_hops=2, batch_size=10, shuffle=True, add_self_loops=True) batch_1 = loader.get_batches__(data.train_mask)[0] sub_graph_1 = loader.produce_subgraph__(batch_1)`
sub_graph_1 now includes nodes where data.train_mask = false, and thus I did not normalize these node features.
🚀 Feature
I would appreciate a feature that implements smart minibatching algorithms like https://arxiv.org/pdf/1905.07953.pdf for node level tasks.
Motivation
I see currently batch_size and batch tensor are only concatenating graphs in the minibatch and records which graph the node corresponds to for all graphs in the batch. This works for graph level tasks very well but not a huge graph that we want to implement mini-batches by sampling the nodes and edges.