Closed tbright17 closed 1 year ago
Hi,
If I understand you correctly, you are saving and loading dense adjacency matrices and want to convert them to a sparse layout. In general, this approach is not recommended, because of the huge memory overload when stacking dense adjacency matrices block-wise. If you want to convert your dense matrices to sparse matrices, you can make use of the nonzero
method from PyTorch.
edge_index = adj.nonzero().t().contiguous()
Thanks for the reply. I also want to do mini-batch training so each batch has several graphs. Can I achieve that?
Thanks
This can be automatically achieved when using the torch_geometric.data.DataLoader
.
Thanks. I will check this.
Hi,
Thanks again for the awesome package. I have a few question related to this, so I did not want to create separate issue:
1) From the examples it appears that mini-batching is done by graph only? Is it possible to do this by node, or edge? (I only have one large graph) 2) For SplineConv, is the entire edge_index, edge_attr and node_set provided for each forward pass?
Background: I have a large dense graph with ~260000 nodes, with approximately 8 node features. From which I am trying to train a node classification model (binary classes), with the help of a 3D manifold (this forms my pseudo-path/co-ordinates). Edges are independent of the manifold itself, though their attributes are derived from it.
Hi, Yes, this is a very important subject and something that is currently not supported. This would probably be done by a new dataloader which samples nodes from the graph and builds a k-hop subgraph around each node (where k corresponds to the number of layers). I will try to add this to the package. There are already a bunch of papers on this topic, so if you need something specific, please point me to the respective papers.
Hi,
I have looked into the L-hop strategy, but with the density of my networks, this will effectively be the entire network with only a few layers.
There is the paper (https://arxiv.org/pdf/1710.10568.pdf), with tensorflow code (https://github.com/thu-ml/stochastic_gcn), that looks promising. The code is not documented, with only a few comments, which makes it quite hard to parse (that and I have not used tensorflow at all). The idea however seems quite straight forward, I am not sure how to integrate this with splineconv though.
Might be worth looking into this as well: https://arxiv.org/pdf/1801.10247.pdf
They report significant speedups over the Kipf Welling GCN and GraphSAGE with comparable performance.
Expecting to see the mini-batching support for a single big graph.
Hi, any progress on this or any work around in the meanwhile?
Been working on it :)
I added a first version of NeighborSampler
to PyTorch Geometric. It is still undocumented and unfinished (e.g. there is currently no support for num_workers
and node probabilities). You can find a training example on the giant Reddit graph in the examples/
directory here. PyTorch Cluster needs to be upgraded to v1.4.0
in order to use.
I would be very happy to discuss the API here and get feedback from you. Currently, you can iterate the loader
and access a batch-wise DataFlow
object, which defines a computation flow up to num_hops+1
layers. You can print it to see how many nodes are being accessed in each layer, e.g.:
DataFlow(1000<-20000<-60000)
Each block in DataFlow
defines a bipartite graph between intermediate layers. Starting from the root nodes, you can propagate messages to the final nodes.
@rusty1s Thanks for the update. I'm having trouble getting torch-cluster v1.4.0
installed. v1.2.1
installation gives me no problems.
Here's the error:
cpu/graclus.cpp: In lambda function:
cpu/graclus.cpp:52:43: error: invalid initialization of reference of type ‘const at::Type&’
from expression of type ‘c10::ScalarType’
AT_DISPATCH_ALL_TYPES(weight.scalar_type(), "weighted_graclus", [&] {
^
/scratch/gobi1/forsterd/pytorchEnv/lib/python3.7/site-packages/torch/lib/include/ATen/Dispatch.h:70:32: note: in definition of macro ‘AT_DISPATCH_ALL_TYPES’
const at::Type& the_type = TYPE; \
^
error: command 'gcc' failed with exit status 1
Any ideas?
Yes, you need to update to PyTorch 1.1. Sorry :(
@rusty1s That solved it, thanks!
@rusty1s Thanks very much for adding the NeighborSampler
. If I understood correctly, it supports node iterator which is helpful for supervised task. For unsupervised task, however, I guess having an edge iterator would be necessary. For now, I am just trying to reproduce the result (unsupervised setting) from GraphSAGE paper. Is there any plan of adding that support in future? Thanks a lot!
Can you elaborate a bit more? I am not sure if I fully understand. For GraphSAGE unsupervised learning, you need negative sampling and a sampling scheme for near neighbors. Negative sampling is trivial via the use of another randomly shuffling NeighborSampler
. For sampling of neighbors, you need to sample node indices from a random walk window and input those to the NeighborSampler
for producing a corresponding DataFlow
object, .e.g.:
sampler = NeighborSampler(..., shuffle=True)
negative_sampler = NeighborSampler(..., shuffle=True)
for data_flow, negative_data_flow in zip(sampler, negative_sampler):
n_id = data_flow.n_id # Get node indices
rw_sampled_n_id = ... # Sample node indices from random walks starting from `n_id`
# Get another `data_flow` object to the sampled node indices
neighboring_data_flow = sampler.__produce__(rw_sampled_n_id)
# Compute embeddings for each `data_flow` object
z = model(data_flow)
negative_z = model(negative_data_flow)
neighboring_z = model(neighboring_data_flow)
...
# Compute loss
...
For random walk sampling, we provide a GPU only and still undocumented functionality in torch-cluster
. WDYT?
@rusty1s I was thinking of iterating through all the edges in the graph instead of iterating through all the nodes in one epoch. But your approach is much cleaner! Thank you so much! :)
Hey @rusty1s quick question: does data flow computation necessarily happen on the CPU? I.e. is it always necessary to send the data flow object to the device you're using after it's computed on the CPU?
Yes, I am not sure if GPUs could bring any speed ups here. In the end it would require the whole graph on the GPU, something we want to prevent with this approach.
@rusty1s Thank you for the graphsage implementation. Are there any plans to implement a sampler for FastGCN as well?
Yes :)
In unsupervised GraphSAGE, the original code iterate over edges instead of nodes and compute the reconstruction loss on sampled edges. Do you have plans to support that? Not sure how iterating over nodes will help in unsupervised case?
I am not sure I fully understand why you can not implement the unsupervised GraphSAGE loss with the current neighbor sampling API, see https://github.com/rusty1s/pytorch_geometric/issues/64#issuecomment-505734268. Can you elaborate?
I looked into it and yes you can. I have two points:
In your code above, it is not really clear to me how would you do negative sampling. Negative samples for each batch are nodes which are not connected to existing nodes in the batch. How do you ensure that?
Also, can you please guide me more on how to structure the loss and validation for unsupervised learning. I want to use reconstruction loss (reconstructing edges based on embeddings). Currently in Tensorflow, I am batching by edges which helps me remove few edges from the training dataset and use max margin loss. Not sure if that provision is built in your code right now?
I am fairly new to Pytorch so don't mind if these simple questions.
In your code above, it is not really clear to me how would you do negative sampling. Negative samples for each batch are nodes which are not connected to existing nodes in the batch. How do you ensure that?
You certainly could use more sophisticated negative sampling strategies. A basic strategy to achieve this would be to just resample nodes which already occur in the sampled neighborhood. In practice, this is negligible IMO.
Also, can you please guide me more on how to structure the loss and validation for unsupervised learning. I want to use reconstruction loss (reconstructing edges based on embeddings). Currently in Tensorflow, I am batching by edges which helps me remove few edges from the training dataset and use max margin loss. Not sure if that provision is built in your code right now?
Batching by edges is also a good idea to implement the GraphSAGE loss. Given that you have sampled positive edges and sampled negative edges, you can compute the source and target node embeddings using the NeighborSampler
:
pos_edge_index_batch = ...
neg_edge_index_batch = ...
sampler = NeighborSampler(..., shuffle=False)
for z_u_data_flow, z_v_data_flow, z_vn_data_flow in zip(sampler(pos_edge_index_batch[0],
sampler(pos_edge_index_batch[1],
sampler(neg_edge_index_batch[1]):
z_u = model(z_u_data_flow)
z_v = model(z_v_data_flow)
z_vn = model(z_vn_data_flow)
# Compute loss based on Eq. (1) of https://arxiv.org/pdf/1706.02216.pdf
Thanks. I think that makes sense. Just trying to get my understanding around your library to implement edge based batching. Can you guide me where and how do I implement this while maintaining your code design. I can submit a pull request with that eventually.
I think a simple example on how to implement the GraphSAGE unsupervised loss should be sufficient. Feel free to submit a PR :)
@rusty1s Thanks for the update! I ran into the Segmentation fault
when running the example reddit.py
Fatal Python error: Segmentation fault
Current thread 0x00007fe21326f740 (most recent call first):
File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_cluster/sampler.py", line 13 in neighbor_sampler
File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_geometric/data/sampler.py", line 154 in __produce__
File "/data/1/weiwen/miniconda3/envs/pyg/lib/python3.7/site-packages/torch_geometric/data/sampler.py", line 199 in __call__
File "reddit.py", line 68 in train
File "reddit.py", line 89 in <module>
Segmentation fault
Any idea how I can fix it? Thanks in advance!
Can you run the torch-cluster
test suite to verify that everything works as expected?
The test outputs said that my compiler (g++ 4.8.5) was incompatible. I upgraded it and it works fine now. Thanks:)
Heyyy @rusty1s, I'm trying to make NNConv
supported for NeighborSampler
.
One quick question: how can I obtain edge_attr
in each block? Should I add self-loops to the graph first and then use e_id
to index them?
Exactly, self-loops need to exist in advance in order to use e_id
. In addition, you need to set add_self_loops=True
so that self-loops are guaranteed to be sampled.
Is there any smart way to do inference with the neighborhood sampler? I am using it for training on large graphs, but for inference I run out of memory very quickly.
I am thinking of something like sampling n (overlapping) regions from the graph and making sure that every node is sampled at least once. Is that currently possible?
Thanks!
@raphaelsulzer If you can fit all the neighbours of any single node in memory you can try this:
ns = NeighborSampler(data,
size=1.0,
num_hops=num_hops,
batch_size=1,
shuffle=False,
add_self_loops=True)
This will sample each node and its num_hops
neighbourhood so you can do a forward pass on each node one-by-one.
@duncster94 Thank you! That works!
Hi @rusty1s,
I am trying to train and classify only a fraction of the total number of classes in a continual learning setting. For example, for Cora dataset with 7 classes:
path = osp.join(osp.dirname(osp.realpath(__file__)), '..', 'data', 'Cora')
data = Planetoid(path, "Cora", transform=transforms.NormalizeFeatures())[0]
_, _, classes, train_mask, val_mask, test_mask = data
classes_in_task = [0,1,2]
conditions = torch.BoolTensor( [l in classes_in_task for l in classes[1]] )
train_mask = (train_mask[0], train_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **train** dataset
val_mask = (val_mask[0], val_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **test** dataset
test_mask = (test_mask[0], test_mask[1] * conditions) # Mask all nodes with classes not in [0,1,2] in **validation** dataset
train_task_nid = np.nonzero(train_mask[1]) # Select node_ids with classes in [0,1,2]
train_task_nid = torch.flatten(train_task_nid)
train_loader = NeighborLoader(
data,
num_neighbors=[30]*2,
batch_size=2000,
input_nodes=train_task_nid,
)
I use the train_loader to train the nodes with only specific classes in multiple batches. But, training the network this way for only specific nodes isn't giving me good results. Is there any workaround for this?
Also, during the testing, I want to test all the test nodes as a single batch. How to set this in Neighborloader
arguments?
What do you mean with "it does not give good results"? Is it due to your learning setup or due to NeighborLoader
?
During evaluation, you can omit the use of NeighborLoader
if you want to test on a single batch/full-batch.
During evaluation, you can omit the use of
NeighborLoader
if you want to test on a single batch/full-batch.
If I omit the usage of NeighborLoader
, how can I select the input nodes with only the classes [0,1,2] for testing? Is there a function for only selecting a particular set of nodes? Thanks for the reply.
After obtaining the model output, you apply masking only the node embeddings you are interested in:
out = model(data.x, data.edge_index)
out_train = out[data.train_mask]
Hi ! I'am trying to train an unsupervised GraphSAGE model on an heterogeneous graph and I have some difficulties to compute the loss with the use of NeighborLoader. Thank you fro the help
Can you share some more details (at best in a new issue)? I'm happy to help!
@rusty1s how can I know which nodes are the center nodes in the sampled data obtained by torch_geometric.loader.NeighborLoader
.
The center nodes will be placed first in the sampled output. As such, they are given via slicing based on the batch size (the number of center nodes), that is batch.x[:batch_size]
. We also apply this slicing in the linked example. Hope this clarifies your doubts!
Hello,
thanks for the awesome code!
I am trying to implement the negative sampling algorithm used here but using the new NeighborLoader class.
Is there already an implementation of this?
My understanding is that I need to replace
train_loader = NeighborSampler(data.edge_index, sizes=[10, 10], batch_size=256, shuffle=True, num_nodes=data.num_nodes)
by
kwargs = {'batch_size': 256} train_loader = NeighborLoader(data,num_neighbors=[10,10],shuffle=True, **kwargs)
Then, when I run
batch = next(iter(train_loader))
the output 'batch' looks exactly the same shape as the input "data", except for the edge_index is smaller and now I have a "batch_size" attribute too. It appears that 'data.x' has been permuted, and data.edge_index has been subsampled.
Are the 256 sampled nodes given by batch.x[:batch_size] (as per your previous answer)? If so, what are the other nodes batch.x[batch_size:]? In particular, where are the positive and negative samples? Also, what is data.edge_index? - is it the induced subgraph of the center nodes or does it also include the positvie/negative samples?
There exists a LinkNeighborLoader
class that also supports negative sampling via neg_sampling_ratio
.
请问:NeighborSampler(data, size=[1.0, 1.0], num_hops=2, batch_size=batch_size, shuffle=True, add_self_loops=True) 转换为NeighborLoader函数应该如何表达吗?
Hi Matthias,
I wrote my own dataset and dataloader and I used adjacent matrix instead of edge_index. When I tried to convert adj_matrix to edge_index, I got confused because I have multiple graphs (multiple samples, may have different number of nodes) in one batch. I went over some of the examples and found most of them have batch_size 1. How should I prepare the edge_index in mini-batch setting? I can easily use DenseSAGEConv but I want to try other networks.
Thanks, Ming