Neighborhood Sampling - Githubissues

tbright17 commented 5 years ago

Hi Matthias,

I wrote my own dataset and dataloader and I used adjacent matrix instead of edge_index. When I tried to convert adj_matrix to edge_index, I got confused because I have multiple graphs (multiple samples, may have different number of nodes) in one batch. I went over some of the examples and found most of them have batch_size 1. How should I prepare the edge_index in mini-batch setting? I can easily use DenseSAGEConv but I want to try other networks.

Thanks, Ming

rusty1s commented 5 years ago

Hi,

If I understand you correctly, you are saving and loading dense adjacency matrices and want to convert them to a sparse layout. In general, this approach is not recommended, because of the huge memory overload when stacking dense adjacency matrices block-wise. If you want to convert your dense matrices to sparse matrices, you can make use of the nonzero method from PyTorch.

edge_index = adj.nonzero().t().contiguous()

tbright17 commented 5 years ago

Thanks for the reply. I also want to do mini-batch training so each batch has several graphs. Can I achieve that?

Thanks

rusty1s commented 5 years ago

This can be automatically achieved when using the torch_geometric.data.DataLoader.

tbright17 commented 5 years ago

Thanks. I will check this.

zeneofa commented 5 years ago

Hi,

Thanks again for the awesome package. I have a few question related to this, so I did not want to create separate issue:

1) From the examples it appears that mini-batching is done by graph only? Is it possible to do this by node, or edge? (I only have one large graph) 2) For SplineConv, is the entire edge_index, edge_attr and node_set provided for each forward pass?

Background: I have a large dense graph with ~260000 nodes, with approximately 8 node features. From which I am trying to train a node classification model (binary classes), with the help of a 3D manifold (this forms my pseudo-path/co-ordinates). Edges are independent of the manifold itself, though their attributes are derived from it.

rusty1s commented 5 years ago

Hi, Yes, this is a very important subject and something that is currently not supported. This would probably be done by a new dataloader which samples nodes from the graph and builds a k-hop subgraph around each node (where k corresponds to the number of layers). I will try to add this to the package. There are already a bunch of papers on this topic, so if you need something specific, please point me to the respective papers.

zeneofa commented 5 years ago

Hi,

I have looked into the L-hop strategy, but with the density of my networks, this will effectively be the entire network with only a few layers.

There is the paper (https://arxiv.org/pdf/1710.10568.pdf), with tensorflow code (https://github.com/thu-ml/stochastic_gcn), that looks promising. The code is not documented, with only a few comments, which makes it quite hard to parse (that and I have not used tensorflow at all). The idea however seems quite straight forward, I am not sure how to integrate this with splineconv though.

duncster94 commented 5 years ago

Might be worth looking into this as well: https://arxiv.org/pdf/1801.10247.pdf

They report significant speedups over the Kipf Welling GCN and GraphSAGE with comparable performance.

familyld commented 5 years ago

Expecting to see the mini-batching support for a single big graph.

fkhawar commented 5 years ago

Hi, any progress on this or any work around in the meanwhile?

rusty1s commented 5 years ago

Been working on it :)

rusty1s commented 5 years ago

I added a first version of NeighborSampler to PyTorch Geometric. It is still undocumented and unfinished (e.g. there is currently no support for num_workers and node probabilities). You can find a training example on the giant Reddit graph in the examples/ directory here. PyTorch Cluster needs to be upgraded to v1.4.0 in order to use.

I would be very happy to discuss the API here and get feedback from you. Currently, you can iterate the loader and access a batch-wise DataFlow object, which defines a computation flow up to num_hops+1 layers. You can print it to see how many nodes are being accessed in each layer, e.g.:

DataFlow(1000<-20000<-60000)

Each block in DataFlow defines a bipartite graph between intermediate layers. Starting from the root nodes, you can propagate messages to the final nodes.

duncster94 commented 5 years ago

@rusty1s Thanks for the update. I'm having trouble getting torch-cluster v1.4.0 installed. v1.2.1 installation gives me no problems.

Here's the error:

cpu/graclus.cpp: In lambda function:
    cpu/graclus.cpp:52:43: error: invalid initialization of reference of type ‘const at::Type&’
from expression of type ‘c10::ScalarType’
       AT_DISPATCH_ALL_TYPES(weight.scalar_type(), "weighted_graclus", [&] {
                                               ^
    /scratch/gobi1/forsterd/pytorchEnv/lib/python3.7/site-packages/torch/lib/include/ATen/Dispatch.h:70:32: note: in definition of macro ‘AT_DISPATCH_ALL_TYPES’
         const at::Type& the_type = TYPE;                                         \
                                    ^
    error: command 'gcc' failed with exit status 1

Any ideas?

rusty1s commented 5 years ago

Yes, you need to update to PyTorch 1.1. Sorry :(

duncster94 commented 5 years ago

@rusty1s That solved it, thanks!

mainak124 commented 5 years ago

@rusty1s Thanks very much for adding the NeighborSampler. If I understood correctly, it supports node iterator which is helpful for supervised task. For unsupervised task, however, I guess having an edge iterator would be necessary. For now, I am just trying to reproduce the result (unsupervised setting) from GraphSAGE paper. Is there any plan of adding that support in future? Thanks a lot!

rusty1s commented 5 years ago

Can you elaborate a bit more? I am not sure if I fully understand. For GraphSAGE unsupervised learning, you need negative sampling and a sampling scheme for near neighbors. Negative sampling is trivial via the use of another randomly shuffling NeighborSampler. For sampling of neighbors, you need to sample node indices from a random walk window and input those to the NeighborSampler for producing a corresponding DataFlow object, .e.g.:

sampler = NeighborSampler(..., shuffle=True)
negative_sampler = NeighborSampler(..., shuffle=True)

for data_flow, negative_data_flow in zip(sampler, negative_sampler):
    n_id = data_flow.n_id  # Get node indices
    rw_sampled_n_id = ...  # Sample node indices from random walks starting from `n_id`   
    # Get another `data_flow` object to the sampled node indices 
    neighboring_data_flow = sampler.__produce__(rw_sampled_n_id)

    # Compute embeddings for each `data_flow` object
    z = model(data_flow)
    negative_z = model(negative_data_flow)
    neighboring_z = model(neighboring_data_flow)
    ...

    # Compute loss
    ...

For random walk sampling, we provide a GPU only and still undocumented functionality in torch-cluster. WDYT?

mainak124 commented 5 years ago

@rusty1s I was thinking of iterating through all the edges in the graph instead of iterating through all the nodes in one epoch. But your approach is much cleaner! Thank you so much! :)

duncster94 commented 5 years ago

Hey @rusty1s quick question: does data flow computation necessarily happen on the CPU? I.e. is it always necessary to send the data flow object to the device you're using after it's computed on the CPU?

rusty1s commented 5 years ago

Yes, I am not sure if GPUs could bring any speed ups here. In the end it would require the whole graph on the GPU, something we want to prevent with this approach.

JensTheDude commented 5 years ago

@rusty1s Thank you for the graphsage implementation. Are there any plans to implement a sampler for FastGCN as well?

rusty1s commented 5 years ago

Yes :)

ankitjain451 commented 5 years ago

In unsupervised GraphSAGE, the original code iterate over edges instead of nodes and compute the reconstruction loss on sampled edges. Do you have plans to support that? Not sure how iterating over nodes will help in unsupervised case?

rusty1s commented 5 years ago

I am not sure I fully understand why you can not implement the unsupervised GraphSAGE loss with the current neighbor sampling API, see https://github.com/rusty1s/pytorch_geometric/issues/64#issuecomment-505734268. Can you elaborate?

ankitjain451 commented 5 years ago

I looked into it and yes you can. I have two points:

In your code above, it is not really clear to me how would you do negative sampling. Negative samples for each batch are nodes which are not connected to existing nodes in the batch. How do you ensure that?
Also, can you please guide me more on how to structure the loss and validation for unsupervised learning. I want to use reconstruction loss (reconstructing edges based on embeddings). Currently in Tensorflow, I am batching by edges which helps me remove few edges from the training dataset and use max margin loss. Not sure if that provision is built in your code right now?

I am fairly new to Pytorch so don't mind if these simple questions.

rusty1s commented 5 years ago

In your code above, it is not really clear to me how would you do negative sampling. Negative samples for each batch are nodes which are not connected to existing nodes in the batch. How do you ensure that?

You certainly could use more sophisticated negative sampling strategies. A basic strategy to achieve this would be to just resample nodes which already occur in the sampled neighborhood. In practice, this is negligible IMO.

Also, can you please guide me more on how to structure the loss and validation for unsupervised learning. I want to use reconstruction loss (reconstructing edges based on embeddings). Currently in Tensorflow, I am batching by edges which helps me remove few edges from the training dataset and use max margin loss. Not sure if that provision is built in your code right now?

Batching by edges is also a good idea to implement the GraphSAGE loss. Given that you have sampled positive edges and sampled negative edges, you can compute the source and target node embeddings using the NeighborSampler:

pos_edge_index_batch = ...
neg_edge_index_batch = ...

sampler = NeighborSampler(..., shuffle=False)

for z_u_data_flow, z_v_data_flow, z_vn_data_flow in zip(sampler(pos_edge_index_batch[0],
                                                        sampler(pos_edge_index_batch[1],
                                                        sampler(neg_edge_index_batch[1]): 
    z_u = model(z_u_data_flow)
    z_v = model(z_v_data_flow)
    z_vn = model(z_vn_data_flow)
    # Compute loss based on Eq. (1) of https://arxiv.org/pdf/1706.02216.pdf

ankitjain451 commented 5 years ago

Thanks. I think that makes sense. Just trying to get my understanding around your library to implement edge based batching. Can you guide me where and how do I implement this while maintaining your code design. I can submit a pull request with that eventually.

rusty1s commented 5 years ago

I think a simple example on how to implement the GraphSAGE unsupervised loss should be sufficient. Feel free to submit a PR :)

wwliu555 commented 5 years ago

@rusty1s Thanks for the update! I ran into the Segmentation faultwhen running the example reddit.py

Fatal Python error: Segmentation fault

Current thread 0x00007fe21326f740 (most recent call first):
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_cluster/sampler.py", line 13 in neighbor_sampler
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_geometric/data/sampler.py", line 154 in __produce__
  File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_geometric/data/sampler.py", line 199 in __call__
  File "reddit.py", line 68 in train
  File "reddit.py", line 89 in <module>
Segmentation fault

Any idea how I can fix it? Thanks in advance!

rusty1s commented 5 years ago

Can you run the torch-cluster test suite to verify that everything works as expected?

wwliu555 commented 5 years ago

The test outputs said that my compiler (g++ 4.8.5) was incompatible. I upgraded it and it works fine now. Thanks:)

wwliu555 commented 5 years ago

Heyyy @rusty1s, I'm trying to make NNConv supported for NeighborSampler.

One quick question: how can I obtain edge_attr in each block? Should I add self-loops to the graph first and then use e_id to index them?

rusty1s commented 5 years ago

Exactly, self-loops need to exist in advance in order to use e_id. In addition, you need to set add_self_loops=True so that self-loops are guaranteed to be sampled.

raphaelsulzer commented 4 years ago

Is there any smart way to do inference with the neighborhood sampler? I am using it for training on large graphs, but for inference I run out of memory very quickly.

I am thinking of something like sampling n (overlapping) regions from the graph and making sure that every node is sampled at least once. Is that currently possible?

Thanks!

duncster94 commented 4 years ago

@raphaelsulzer If you can fit all the neighbours of any single node in memory you can try this:

ns = NeighborSampler(data,
    size=1.0,
    num_hops=num_hops,
    batch_size=1,
    shuffle=False,
    add_self_loops=True)

This will sample each node and its num_hops neighbourhood so you can do a forward pass on each node one-by-one.

raphaelsulzer commented 4 years ago

@duncster94 Thank you! That works!

codexhammer commented 2 years ago

Hi @rusty1s,

I am trying to train and classify only a fraction of the total number of classes in a continual learning setting. For example, for Cora dataset with 7 classes:

path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'Cora')
data = Planetoid(path, "Cora", transform=transforms.NormalizeFeatures())[0]
_, _, classes, train_mask, val_mask, test_mask = data
classes_in_task = [0,1,2]

conditions = torch.BoolTensor( [l in classes_in_task for l in classes[1]] )
train_mask = (train_mask[0], train_mask[1] * conditions)  # Mask all nodes with classes not in [0,1,2] in **train** dataset
val_mask = (val_mask[0], val_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **test** dataset
test_mask = (test_mask[0], test_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **validation** dataset

train_task_nid = np.nonzero(train_mask[1])  # Select node_ids with classes in [0,1,2]
train_task_nid = torch.flatten(train_task_nid)

train_loader = NeighborLoader(
        data,
        num_neighbors=[30]*2,
        batch_size=2000,
        input_nodes=train_task_nid,
        )

I use the train_loader to train the nodes with only specific classes in multiple batches. But, training the network this way for only specific nodes isn't giving me good results. Is there any workaround for this?

Also, during the testing, I want to test all the test nodes as a single batch. How to set this in Neighborloader arguments?

rusty1s commented 2 years ago

What do you mean with "it does not give good results"? Is it due to your learning setup or due to NeighborLoader?

During evaluation, you can omit the use of NeighborLoader if you want to test on a single batch/full-batch.

codexhammer commented 2 years ago

During evaluation, you can omit the use of NeighborLoader if you want to test on a single batch/full-batch.

If I omit the usage of NeighborLoader, how can I select the input nodes with only the classes [0,1,2] for testing? Is there a function for only selecting a particular set of nodes? Thanks for the reply.

rusty1s commented 2 years ago

After obtaining the model output, you apply masking only the node embeddings you are interested in:

out = model(data.x, data.edge_index)
out_train = out[data.train_mask]

vincentvic commented 2 years ago

Hi ! I'am trying to train an unsupervised GraphSAGE model on an heterogeneous graph and I have some difficulties to compute the loss with the use of NeighborLoader. Thank you fro the help

rusty1s commented 2 years ago

Can you share some more details (at best in a new issue)? I'm happy to help!

kayzliu commented 2 years ago

@rusty1s how can I know which nodes are the center nodes in the sampled data obtained by torch_geometric.loader.NeighborLoader.

rusty1s commented 2 years ago

The center nodes will be placed first in the sampled output. As such, they are given via slicing based on the batch size (the number of center nodes), that is batch.x[:batch_size]. We also apply this slicing in the linked example. Hope this clarifies your doubts!

agosztolai commented 2 years ago

Hello,

thanks for the awesome code!

I am trying to implement the negative sampling algorithm used here but using the new NeighborLoader class.

Is there already an implementation of this?

My understanding is that I need to replace

train_loader = NeighborSampler(data.edge_index, sizes=[10, 10], batch_size=256, shuffle=True, num_nodes=data.num_nodes)

by

kwargs = {'batch_size': 256} train_loader = NeighborLoader(data,num_neighbors=[10,10],shuffle=True, **kwargs)

Then, when I run

batch = next(iter(train_loader))

the output 'batch' looks exactly the same shape as the input "data", except for the edge_index is smaller and now I have a "batch_size" attribute too. It appears that 'data.x' has been permuted, and data.edge_index has been subsampled.

Are the 256 sampled nodes given by batch.x[:batch_size] (as per your previous answer)? If so, what are the other nodes batch.x[batch_size:]? In particular, where are the positive and negative samples? Also, what is data.edge_index? - is it the induced subgraph of the center nodes or does it also include the positvie/negative samples?

rusty1s commented 2 years ago

There exists a LinkNeighborLoader class that also supports negative sampling via neg_sampling_ratio.

v01cano commented 1 year ago

请问：NeighborSampler(data, size=[1.0, 1.0], num_hops=2, batch_size=batch_size, shuffle=True, add_self_loops=True) 转换为NeighborLoader函数应该如何表达吗？

pyg-team / pytorch_geometric

Neighborhood Sampling #64