stellargraph / stellargraph

StellarGraph - Machine Learning on Graphs
https://stellargraph.readthedocs.io/
Apache License 2.0
2.95k stars 431 forks source link

Use multiple graphs in mini batch generator #1275

Open davidxujiayang opened 4 years ago

davidxujiayang commented 4 years ago

Hi,

Is there a way to use the ClusterNodeGenerator on a list of graphs?

Thanks, -David

huonw commented 4 years ago

Hi @davidxujiayang, there isn't at the moment. A work-around for now would be to concatenate each subgraphs into one huge graph. That is, instead of having many StellarGraph objects, have one. I guess one risk is that training may be inefficient, because there will be a lot clusters that have nodes from many graphs, and not many edges will be being considered.

Could you say a bit about what sort of graphs you're working with and the problem you're solving? Maybe there's an alternative approach we can use easily.

Thanks for using StellarGraph and filing an issue!

davidxujiayang commented 4 years ago

Hi Huon,

Thanks for the reply! My current work around is I actually modified the generator a bit so that the output shape for the output feature is (n_case, n_nodeid, n_feature), where n_case is the number of training graphs, instead of the current (1, n_nodeid, n_feature). One drawback I realize is that this way is that in each batch the cluster selected for all cases are the same. Do you think there are other problems with it?

The problem I'm trying to address is to predict pressure distribution on different CFD meshes. So the input features are x,y coordinates as well as a distance function that defines the distance between each point and the wall, the target I want to predict is the pressure (a scalar value on each point). I'm still prototyping with the clustered GCN, if you have any suggestions on this it would be very appreciated!

Thanks,

On Sun, Apr 19, 2020 at 6:57 PM Huon Wilson notifications@github.com wrote:

Hi @davidxujiayang https://github.com/davidxujiayang, there isn't at the moment. A work-around for now would be to concatenate each subgraphs into one huge graph. That is, instead of having many StellarGraph objects, have one. I guess one risk is that training may be inefficient, because there will be a lot clusters that have nodes from many graphs, and not many edges will be being considered.

Could you say a bit about what sort of graphs you're working with and the problem you're solving? Maybe there's an alternative approach we can use easily.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stellargraph/stellargraph/issues/1275#issuecomment-616239470, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZJFPSIQ5FNWVS6GQSSPGLRNN6T3ANCNFSM4ML5GEEQ .

huonw commented 4 years ago

From your description, that sounds like a very reasonable work-around. If you're comfortable sharing the code, we might even be able to put it into the library.

One drawback I realize is that this way is that in each batch the cluster selected for all cases are the same. Do you think there are other problems with it?

I'm not sure how well it generalises to graphs with different sizes, where the clusters won't all have the same number of elements. Are your graphs all the same size?

The problem I'm trying to address is to predict pressure distribution on different CFD meshes. So the input features are x,y coordinates as well as a distance function that defines the distance between each point and the wall, the target I want to predict is the pressure (a scalar value on each point). I'm still prototyping with the clustered GCN, if you have any suggestions on this it would be very appreciated!

That sound very interesting!

Another approach that is inductive and so generalises to unseen graphs would be using GraphSAGE: https://github.com/stellargraph/stellargraph/blob/master/demos/node-classification/graphsage/graphsage-cora-node-classification-example.ipynb

This would definitely require putting everything into a single graph/StellarGraph. However, unlike Cluster-GCN, this shouldn't suffer from any sparsity problems: for each training node, a sample is taken around it directly, so it doesn't care if there's many disconnected components in the overall graph.

davidxujiayang commented 4 years ago

list_mini_batch_node_generators.zip

Hi Huon,

I'm more than happy to share my code, please see attached. I took a slightly different approach from what I described above, because it seems that the graphconvolution layer specifies the input to be of size (1, ...) and I'm not sure if that's required for performance reasons or just an implementation limitation. The attached code mainly modified the getitem part, which now randomizes both the graph list and the clusters.

The code should be able to replace the original mini_batch_node_generator.py as it wraps a graph into a list if it is not one, but currently I'm only using it in my project dir without modifying the original file. The code itself cannot generalize for graphs with different sizes. I guess the generalization can be done in the graph generation step, i.e. let all graphs to be generated using the same list of node ids, (or be generated separately and mapped to their union). This way some of the nodes will not be used by any edges for smaller graphs. I'm not sure if it is the correct way to do it. Currently my graphs are all the same size.

One possible bug (could be totally due to that I'm not understanding/using it correctly) in both the original mini_batch_node_generator and my code is that the on_epoch_end does not seem to be called correctly. I tested by adding a printing function. The current workaround in my code is to copy the part in on_epoch_end to len, which is called at the end of every epoch. I don't know if it's safe to do so.

Thanks for your suggestion about using GraphSAGE. My choice of cluster-GCN is due to its smaller memory cost. Some of my target CFD meshes have around 200M nodes, and after some literature review it seems that cluster GCN is the only feasible option on my limited computing resources. But I'm very new to graph neural networks, and I could be totally wrong on this. Please kindly correct me if so.

Thanks, -David

On Sun, Apr 19, 2020 at 9:27 PM Huon Wilson notifications@github.com wrote:

From your description, that sounds like a very reasonable work-around. If you're comfortable sharing the code, we might even be able to put it into the library.

One drawback I realize is that this way is that in each batch the cluster selected for all cases are the same. Do you think there are other problems with it?

I'm not sure how well it generalises to graphs with different sizes, where the clusters won't all have the same number of elements. Are your graphs all the same size?

The problem I'm trying to address is to predict pressure distribution on different CFD meshes. So the input features are x,y coordinates as well as a distance function that defines the distance between each point and the wall, the target I want to predict is the pressure (a scalar value on each point). I'm still prototyping with the clustered GCN, if you have any suggestions on this it would be very appreciated!

That sound very interesting!

Another approach that is inductive and so generalises to unseen graphs would be using GraphSAGE: https://github.com/stellargraph/stellargraph/blob/master/demos/node-classification/graphsage/graphsage-cora-node-classification-example.ipynb

This would definitely require putting everything into a single graph/ StellarGraph. However, unlike Cluster-GCN, this shouldn't suffer from any sparsity problems: for each training node, a sample is taken around it directly, so it doesn't care if there's many disconnected components in the overall graph.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stellargraph/stellargraph/issues/1275#issuecomment-616262257, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZJFPUFSDGVR6C36JGYYWLRNOQJBANCNFSM4ML5GEEQ .

huonw commented 4 years ago

I'm more than happy to share my code, please see attached

Great!

Unfortunately attachments don't come through the GitHub email -> comment conversion; could you post it some other way? For instance, if it's a single file (or only a few), one could put it in a gist. (If it's multiple files, one option would be to create a zip file and attach it to a comment via the GitHub UI.) We have no desire for a particular method, so whatever works for you :smile:

I took a slightly different approach from what I described above, because it seems that the graphconvolution layer specifies the input to be of size (1, ...) and I'm not sure if that's required for performance reasons or just an implementation limitation

It's an implementation limitation. Coincidentally, it has actually been lifted just today, in the latest develop version (via #1205). If you wish to use this new GraphConvolution layer from the library, you can install develop either directly with pip install git+https://github.com/stellargraph/stellargraph.git or by cloning the repo. Alternatively, you can just copy the new version out of the source, or wait a few days because we're planning to do release this week.

The code itself cannot generalize for graphs with different sizes. I guess the generalization can be done in the graph generation step, i.e. let all graphs to be generated using the same list of node ids, (or be generated separately and mapped to their union). This way some of the nodes will not be used by any edges for smaller graphs. I'm not sure if it is the correct way to do it.

Hm, interesting. That's potentially a good route. I guess we'd have to sit down and actually implement to be sure.

Currently my graphs are all the same size.

Makes sense. Just so I understand, are you likely to be working with lists of graphs of different sizes in future?

Thanks for your suggestion about using GraphSAGE. My choice of cluster-GCN is due to its smaller memory cost. Some of my target CFD meshes have around 200M nodes, and after some literature review it seems that cluster GCN is the only feasible option on my limited computing resources. But I'm very new to graph neural networks, and I could be totally wrong on this. Please kindly correct me if so.

Yeah, you're correct that GraphSAGE results in a multiplicative increase in the amount of features that have to be stored in memory. However, the explosion is alleviated in StellarGraph by being a mini-batch method similar to Cluster-GCN; it batches up the requested "head" nodes, and only samples the neighbours (and fetches the features) for a small number of batches at a time.

davidxujiayang commented 4 years ago

Hi ,

Thanks for your reply. I noticed that the email does not carry the attachments correctly and modified my answer directly on the github issue page so you can find a link to the code now. I also modified the comment a bit, where I mentioned a problem about calling the on_epoch_end method, I'm not sure if that's a real problem or just my implementation mistake.

in the future I'm likely to work on graphs with different sizes, they may not be too different from each other though. Maybe a variation of 2~3%. But even this will lead to many difficulties in the training I assume, although I can't foresee any of the details.

Actually after some more literature review today, I realized that Cluster GCN is more like a training treatment on top of other modifications to the convolution operation itself. I think what you described actually has all the essences of Cluster GCN. Do you think there is any substantial difference between them? Have you considered making clustering an option for all methods, if it is not yet?

Thanks again for the detailed reply and all the work!

On Mon, Apr 20, 2020 at 6:03 AM Huon Wilson notifications@github.com wrote:

I'm more than happy to share my code, please see attached

Great!

Unfortunately attachments don't come through the GitHub email -> comment conversion; could you post it some other way? For instance, if it's a single file (or only a few), one could put it in a gist https://help.github.com/en/github/writing-on-github/creating-gists#creating-a-gist. (If it's multiple files, one option would be to create a zip file and attach it to a comment https://help.github.com/en/github/managing-your-work-on-github/file-attachments-on-issues-and-pull-requests via the GitHub UI.) We have no desire for a particular method, so whatever works for you 😄

I took a slightly different approach from what I described above, because it seems that the graphconvolution layer specifies the input to be of size (1, ...) and I'm not sure if that's required for performance reasons or just an implementation limitation

It's an implementation limitation. Coincidentally, it has actually been lifted just today, in the latest develop version (via #1205 https://github.com/stellargraph/stellargraph/pull/1205). If you wish to use this new GraphConvolution layer from the library, you can install develop either directly with pip install git+ https://github.com/stellargraph/stellargraph.git or by cloning the repo https://github.com/stellargraph/stellargraph#install-stellargraph-from-github-source. Alternatively, you can just copy the new version out of the source https://github.com/stellargraph/stellargraph/blob/eac335d472282b9d8b5dc72a41ce805e031d9f01/stellargraph/layer/gcn.py#L28-L218, or wait a few days because we're planning to do release this week.

The code itself cannot generalize for graphs with different sizes. I guess the generalization can be done in the graph generation step, i.e. let all graphs to be generated using the same list of node ids, (or be generated separately and mapped to their union). This way some of the nodes will not be used by any edges for smaller graphs. I'm not sure if it is the correct way to do it.

Hm, interesting. That's potentially a good route. I guess we'd have to sit down and actually implement to be sure.

Currently my graphs are all the same size.

Makes sense. Just so I understand, are you likely to be working with lists of graphs of different sizes in future?

Thanks for your suggestion about using GraphSAGE. My choice of cluster-GCN is due to its smaller memory cost. Some of my target CFD meshes have around 200M nodes, and after some literature review it seems that cluster GCN is the only feasible option on my limited computing resources. But I'm very new to graph neural networks, and I could be totally wrong on this. Please kindly correct me if so.

Yeah, you're correct that GraphSAGE results in a multiplicative increase in the amount of features that have to be stored in memory. However, the explosion is alleviated in StellarGraph by being a mini-batch method similar to Cluster-GCN; it batches up the requested "head" nodes, and only samples the neighbours (and fetches the features) for a small number of batches at a time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stellargraph/stellargraph/issues/1275#issuecomment-616444799, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFZJFPS2HRMEJXRKAZ7HO3DRNQMWHANCNFSM4ML5GEEQ .

huonw commented 4 years ago

I noticed that the email does not carry the attachments correctly and modified my answer directly on the github issue page so you can find a link to the code now.

Thanks, it looks like a reasonable approach given the constraints you have.

I also modified the comment a bit, where I mentioned a problem about calling the on_epoch_end method, I'm not sure if that's a real problem or just my implementation mistake.

Nice catch! It looks like it's a Tensorflow problem, that may be fixed in TF 2.2 (based on the mention of it being fixed in tf_nightly a while ago): https://github.com/tensorflow/tensorflow/issues/35911

I've opened https://github.com/stellargraph/stellargraph/issues/1356 to make sure that StellarGraph handles it properly and validates it is fixed.

in the future I'm likely to work on graphs with different sizes, they may not be too different from each other though. Maybe a variation of 2~3%. But even this will lead to many difficulties in the training I assume, although I can't foresee any of the details.

The problem I see is: the approach requires every graph to have batches in exactly the same shape, this means the clusters have to have the same sizes. If the graphs have different numbers of nodes, there will be some clusters that have more nodes; for instance, with 2 clusters, a graph of size 6 has two clusters of size 3, while a graph of size 7 has a cluster of size 3 and one of size 4. These can't be batched together directly because the tensor shapes don't line up.

One work around for this would be to ensure all the graphs have the same size by padding some isolated nodes onto the smaller ones, to match the largest graph.

Actually after some more literature review today, I realized that Cluster GCN is more like a training treatment on top of other modifications to the convolution operation itself. I think what you described actually has all the essences of Cluster GCN. Do you think there is any substantial difference between them? Have you considered making clustering an option for all methods, if it is not yet?

That's a great point. I'm not sure it's been considered, so I've opened another issue (https://github.com/stellargraph/stellargraph/issues/1355) to track this specifically.

I guess one risk is that training may be inefficient, because there will be a lot clusters that have nodes from many graphs, and not many edges will be being considered.

Ah, going back to this. I forgot that there's a way to improve this, just FYI. The ClusterNodeGenerator(..., clusters=...) argument supports passing in the list of clusters. https://stellargraph.readthedocs.io/en/stable/api.html#stellargraph.mapper.ClusterNodeGenerator

clusters (int or list) – If int then it indicates the number of clusters (default is 1 that is the given graph). If clusters is greater than 1, then nodes are uniformly at random assigned to a cluster. If list, then it should be a list of lists of node IDs such that each list corresponds to a cluster of nodes in G. The clusters should be non-overlapping.

This would allow for creating a single huge graph with all of the meshes as disconnected subgraphs, and then have each cluster only contains nodes from a single graph. For example:

nodes_per_mesh = ... # a list of lists of nodes for each graph
num_clusters = ...

clusters = [
    nodes[idx::num_clusters] # every num_clusters'th node from this mesh
    for nodes in nodes_per_mesh
    for idx in range(num_clusters)
]

assert len(clusters) == len(nodes_per_mesh) * num_clusters

This still has problems for q > 1, since each group of q clusters is unlikely to have many from the same graph, and so there's no edges between clusters in the group, only edges within each cluster.

huonw commented 4 years ago

Hey @davidxujiayang, you may've noticed my comment on #1355: now the APPNP and GAT (and GCN) support taking a ClusterNodeGenerator in addition to a FullBatchNodeGenerator and so can be trained with Clusters (#1531, #1585).