GCNConv on Subgraphs - Githubissues

yutaizhou commented 5 years ago

❓ Questions & Help

Let's say I have a graph G of 1000 nodes, and subgraph of G, G' of 100 nodes. When I pass in G' to a graph convolution layer, add_remaining_self_loops() may throw an error since the egde_index still refer to the nodes by their original index number when they were part of G, not G'. Although G' only has 100 nodes, their edge index from the original graph may have value higher than 100.

The subgraph is created using networkx, and converted via utils.from_networkx(subgraph_nodes)

How should this issue be addressed for both per-case implementation, and for library modification?

rusty1s commented 5 years ago

I do not see why this should happen? There is no caching behavior in GCNConv in case cached=False.

Edit: So you have edge indices higher than 100 in a graph with 100 nodes, so this error is to be expected? Do you expect from_networkx to handle this or what do you suggest? As far as I know, you can relabel node indices in networkx.

yutaizhou commented 5 years ago

It shouldn't have to do with caching. I see it as more of a subgraphing issue, as it only concerns the labeling of nodes. On second thought, it may be a bad idea to have from_networkx handle this, as the way to relabel the subgraphed nodes is not exactly clear.

My specific use case involves a "total universe", G, if you will, and an expanding observable universe as subgraph, G'. Graph and subgraph creation are done with networkx. G' will slowly add in more and more nodes and edges from G, as more observation takes place. Expansion of observable area involves graph convolutions, which will take in a subset of the edge indices of G', not whole. (I apologize, I cannot go into details). So each expansion will involve the problem I mentioned above. See how the best way to relabel the edge indices are not exactly obvious?

I will have to play around with this problem a bit more. Will let you know if from_networkx should be changed, or if I am just using it inappropriately...

Thank you as always for your great work on this fantastic library!

rusty1s commented 5 years ago

Ok, so I wouldn't relabel nodes due to reasons like computational costs and higher implementation effort. It may be best to simply keep the number of nodes fixed, and change edge_index over time. This way, you still do not aggregate messages from isolated nodes (you can ignore the added self-loops), and after convolution you can simply mask out invalid nodes. I still see no reason why add_remaining_loops would crash here. Can you show the error message?

yutaizhou commented 5 years ago

Would it be sufficient to just keep the number of nodes the same, but grow edge_index over time? Would graph convolution automatically exclude the isolated (in this sense, isolated means unobserved) from message passing? Sorry I am not at my work desk now. I will get you the message when I am back.

yutaizhou commented 5 years ago

data.x is of shape 102 x num_features, but some of the edge indices have values way higher than 102. The highest value accepted should be 101.

Below is my error message:

File "/home/yutaizhou/pytorch_geometric/torch_geometric/nn/conv/gcn_conv.py", line 90, in forward edge_index, norm = self.norm(edge_index, x.size(0), edge_weight, self.improved, x.dtype)

File "/home/yutaizhou/pytorch_geometric/torch_geometric/nn/conv/gcn_conv.py", line 76, in norm edge_index, edge_weight = add_remaining_self_loops(edge_index, edge_weight, fill_value, num_nodes)

File "/home/yutaizhou/pytorch_geometric/torch_geometric/utils/loop.py", line 116, in add_remaining_self_loops loop_weight[row[inv_mask]] = remaining_edge_weight

IndexError: index 839 is out of bounds for dimension 0 with size 102

rusty1s commented 5 years ago

Yes, data.x need shape [1000, num_features]. As I said previously, you need some kind of mask vector that masks all valid nodes in your graph. You can then simply do the following:

x = torch.zeros(1000, num_features)
x[mask] = sub_x
conv(x, sub_edge_index)
sub_x = x[mask]

yutaizhou commented 5 years ago

Ah gotcha. I thought this may reveal some insight on a design issue, but I think I was simply using it in an inappropriate way. Thank you very much for the clarification!

yutaizhou commented 5 years ago

Your recommendation worked just fine. Instead of masking the node feature tensor, I passed in the entire tensor to conv() but reconstructed edge_indices to only include edges of nodes of interest by using networkx's has_edge() method. By doing it this way, I can refer to any node via edge_indices (index out of bound won't be an issue since I am passing in the whole feature tensor), but message passing would only collect information from nodes referenced by edge_indices. edge_indices would grow as the observable area of the total universe grows.

Do you think it would be a good idea to include networkx's neighbors() and has_edge() method in the Data class? Or do you think they are better kept in networkx only?

Thank you!

rusty1s commented 5 years ago

Alternatively, we could provide a filter_edge_index(edge_index, mask) function, which takes in the initial edge_index and a node mask. The resulting edge_index does only contain edges from/to valid nodes i with mask[i] = 1. WDYT?

yutaizhou commented 5 years ago

I don't know enough about the field of geometric deep learning to have a good sense about whether such function would be heavily used, but I think that's an excellent idea! I would certainly use it. Networkx's subgraph() works similarly. For a graph G, it retains only the vertices that are in the list passed in as argument, as well as the edges between them. So it's not exactly a mask detailing which indices to keep, but rather just a list of nodes to keep. I am personally more partial towards this interface over the mask you mentioned. Like filter_edge_index(edge_index, nodes_to_keep). What do you think?

Additionally, I don't think such method is an alternative to networkx's neighbors() or has_edge(), as it accomplishes a completely different goal. Do you think these two methods are worth implementing for PyG? Or are they better off kept in networkx only?

rusty1s commented 5 years ago

I added a subgraph method to master.

I'm personally not a big fan of adding neighbors() and has_edge() functionality to PyG, as PyG's data layout of edge_index does not really fit well with those methods and I still did not exactly see their use-cases. In the end, you can simply iterate over edge_index and manually adjust it.

yutaizhou commented 5 years ago

Wow that was super fast! 😆 I will play with that method a bit later and let you know if I find anything.

Use case wise, I think those two functions can be quiet useful, especially neighbors(). But I agree with your point on the layout of edge_index. If I find a strong use case for it, I will try to convince you other wise.

Thank you as always for you fantastic work!

bourne-3 commented 3 years ago

Alternatively, we could provide a filter_edge_index(edge_index, mask) function, which takes in the initial edge_index and a node mask. The resulting edge_index does only contain edges from/to valid nodes i with mask[i] = 1. WDYT?

Thanks for providing this great method.After i get the edge_index with the help of the 'subgraph()'.I intend to call the method 'conv()'.


whole_nodes = graph.x  ## whole nodes without masking
conv(whole_nodes, after_masked_edge_index)

Is it right? I only want to update a certain nodes in graph.

rusty1s commented 3 years ago

The conv layer will change the representation of all nodes, so you may want to reset the representations of nodes outside the given subgraph afterwards:

x = graph.x
new_x = conv(...)
x[node_idx] = new_x[node_idx]

pyg-team / pytorch_geometric

GCNConv on Subgraphs #420

❓ Questions & Help