Graph Classification & Batchwise Training

rushilanirudh commented 7 years ago

Hi I have two questions:

I would like to repurpose this for graph classification (where I have a single label per graph). I see there is an option to choose featureless=False when defining a new GraphConvolution layer. However, the loss is still computed for each node, and I was wondering how I should change your code.
In the context of graph classification, how should I modify your train.py for batch-wise training?

Thanks for putting this together!

MortZx commented 6 years ago

Hi @tkipf , first of all I find this work to be really interesting. My apologies in advance for asking yet another question on how to adapt your code for graph classification. I also understand that my you may not be that familiar with the code anymore and that my question might be too precise for you to provide help.

I am relatively new to machine learning and I'm trying to implement your code to classify molecules (graphs of different sizes) into 2 classes (toxic or non-toxic). I have been working on this project for a few weeks now taking suggestions from the comments above but regardless of what I tried the model terminates early after 12 epochs with an accuracy no higher than 60%. I believe the way I am pre-processing or formatting my data is where I am going wrong as I'm trying to use the model 'off the shelf'.

I have 100 graphs (different number of nodes per graph) with 50 graphs for my 2 classes (Once my model is working I plan on using the full dataset of 1000 graphs). I'm trying to implement the 'hacky' version you mentioned at the beginning of this post by introducing a super node which is connected to all other nodes in a graph and initialized with a feature vector of 0s. I formatted my data such that each node in a graph has the same label as the graph (so if a graph is meant to be part of class 1, all nodes in that graph have the label of class 1).

I'm now unsure how I should separate my data as training features with labels, training features without labels and the test set (obtaining x, tx and allx in the code). I split my data as 60% training set and 40% testing set, with an even spread over both classes. In the training set half of the super nodes are not given a label so that I can still train the model in a semi-supervised fashion. Unless I am going about this the wrong way and I should edit the code to work for supervised-learning.

Even though I am not training the model batch-wise I assume I still have to use the block-diagonal adjacency matrix as described in the image you posted before? I also modified the way the accuracy is calculated by comparing the label prediction of the unlabeled super nodes with the label they should be predicted as.

Again sorry for the long post ... and thanks in advance for any help!

tkipf commented 6 years ago

Sounds like a reasonable set-up. A dataset of 50 labeled graphs is extremely small, so even if you did everything correctly I wouldn’t expect much. It’s probably better to work with datasets on the order of several thousands of samples or more.

Your final adjacency matrix should be block-diagonal if the nodes are ordered according to which graph they belong to. If you don’t order them in any particular way, the only important property is that there are no connections between samples, I.e. every connected component in the batch adjacency matrix corresponds to one graph.

In terms of labeling: please don’t try to follow the data structure of these .tx files etc (this is from a different paper and not a particularly good way to structure the data)- just write your own data loader and put the data in any reasonable format (e.g. csv files).

You only need to put a class label on this one global node per graph and otherwise keep the others unlabeled (Just vectors of 0s will do). Make sure to only put a loss on the labeled global nodes.

Hope this helps

On Mon 27. Aug 2018 at 23:11 MortZx notifications@github.com wrote:

Hi @tkipf https://github.com/tkipf , first of all I find this work to be really interesting. My apologies in advance for asking yet another question on how to adapt your code for graph classification. I also understand that my you may not be that familiar with the code anymore and that my question might be too precise for you to provide help.

I am relatively new to machine learning and I'm trying to implement your code to classify molecules (graphs of different sizes) into 2 classes (toxic or non-toxic). I have been working on this project for a few weeks now taking suggestions from the comments above but regardless of what I tried the model terminates early after 12 epochs with an accuracy no higher than 60%. I believe the way I am pre-processing or formatting my data is where I am going wrong as I'm trying to use the model 'off the shelf'.

I have 100 graphs (different number of nodes per graph) with 50 graphs for my 2 classes (Once my model is working I plan on using the full dataset of 1000 graphs). I'm trying to implement the 'hacky' version you mentioned at the beginning of this post by introducing a super node which is connected to all other nodes in a graph and initialized with a feature vector of 0s. I formatted my data such that each node in a graph has the same label as the graph (so if a graph is meant to be part of class 1, all nodes in that graph have the label of class 1).

I'm now unsure how I should separate my data as training features with labels, training features without labels and the test set (obtaining x, tx and allx in the code). I split my data as 60% training set and 40% testing set, with an even spread over both classes. In the training set half of the super nodes are not given a label so that I can still train the model in a semi-supervised fashion. Unless I am going about this the wrong way and I should edit the code to work for supervised-learning.

Even though I am not training the model batch-wise I assume I still have to use the block-diagonal adjacency matrix as described in the image you posted before? I also modified the way the accuracy is calculated by comparing the label prediction of the unlabeled super nodes with the label they should be predicted as.

Again sorry for the long post ... and thanks in advance for any help!

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/tkipf/gcn/issues/4#issuecomment-416385844, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYFJUCtmrENHkUcG3u_-4WkX0lZETks5uVG6agaJpZM4LofoM .

TZWwww commented 5 years ago

Hi @tkipf, my task is that I have a dataset {(Ai, Xi, yi)}, where Ai is adj matrix, Xi input features over some nodes, yi is the label of (Ai, Xi). That is to say Ai and Aj may be different both in nodes ,edges and dimensions(take web graph for example: Xi may represent a group of users in U.S., while Xj stands for a group of users in U.K.. yi is the label for anomaly detection of a single graph. All datas are from the same web.). I wonder if it works using GCN you propose. In your case, there's only one graph for the whole dataset. I read comments above and think this might work. However, as to my understanding, parameters in a GCN should fit the structure information of a unique graph. So how could it work if the graph varies with each datapoint? Thanks for explaining.

ahmedmazariML commented 5 years ago

Hi @TZWwww ,

Yes you can have different graph structure (different adjacency matrix , variable number of nodes) for each example. But the dimensionality of your features nodes should be the same.

For practical consideration : 1) Make zero padding to get adjacency matrices into the same dimension that apply a mask on each graph. 2) For features dimension , it's preferable to get them into the same dimension (even if you can make a mask and zero padding as in 1) ) . But you can apply PCA and control the number of dimensions to keep.

Hope it helps

TZWwww commented 5 years ago

Thanks very much @ahmedmazariML for your response 👍, and it helps a lot : Sure the feature dimension should be the same. I still have two questions: 1.Can I implement batch-wise training with adj matrix into tensor with dimension [B, m, m] where B is the batch size, m is the maximum number of nodes in that batch(some column could be padding)? 2.Why would this work? Since parameters in GCN should contain the information of a unique graph, do varied graphs leads to non-convergence?

Thanks for your patience. hh

ahmedmazariML commented 5 years ago

@TZWwww

1) Yes

2) Which parameter ? what do you mean by non-convergence ?

TZWwww commented 5 years ago

Thanks for explaining again 👍 The parameters I mean is the learnable parameters in the each layer of GCN. The non-convergence I mean is to say that GCN is mean to solve the problem of unique global adj matrix, it might lead to non-convergence if the adj matrix varies.

athirapvi commented 5 years ago

How can we plot ROC curves for GCN that are capable of classifying graphs? Thanks in advance.

tkipf commented 5 years ago

Maybe this helps: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

On Wed 8. May 2019 at 11:02 athirapvi notifications@github.com wrote:

How can we plot ROC curves for GCN that are capable of classifying graphs? Thanks in advance.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/tkipf/gcn/issues/4#issuecomment-490408650, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYBYYHL54MJCFER7PMGDSDPUKJKRANCNFSM4C5B7IGA .

teenusjohn commented 5 years ago

Can we use centrality as node attributes or will the GCN automatically learn it?

tkipf commented 5 years ago

Yes you can certainly use additional structure-dependent node features. Even adding node degree as a node feature can help.

On Thu, May 9, 2019 at 8:51 AM teenusjohn notifications@github.com wrote:

Can we use centrality as node attributes or will the GCN automatically learn it?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gcn/issues/4#issuecomment-490767167, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYBYYG4LLWRSXUOVST42ODPUPCYTANCNFSM4C5B7IGA .

teenusjohn commented 5 years ago

Thank you so much.

teenusjohn commented 5 years ago

Sounds like a reasonable set-up. A dataset of 50 labeled graphs is extremely small, so even if you did everything correctly I wouldn’t expect much. It’s probably better to work with datasets on the order of several thousands of samples or more. Your final adjacency matrix should be block-diagonal if the nodes are ordered according to which graph they belong to. If you don’t order them in any particular way, the only important property is that there are no connections between samples, I.e. every connected component in the batch adjacency matrix corresponds to one graph. In terms of labeling: please don’t try to follow the data structure of these .tx files etc (this is from a different paper and not a particularly good way to structure the data)- just write your own data loader and put the data in any reasonable format (e.g. csv files). You only need to put a class label on this one global node per graph and otherwise keep the others unlabeled (Just vectors of 0s will do). Make sure to only put a loss on the labeled global nodes. Hope this helps … On Mon 27. Aug 2018 at 23:11 MortZx @.***> wrote: Hi @tkipf https://github.com/tkipf , first of all I find this work to be really interesting. My apologies in advance for asking yet another question on how to adapt your code for graph classification. I also understand that my you may not be that familiar with the code anymore and that my question might be too precise for you to provide help. I am relatively new to machine learning and I'm trying to implement your code to classify molecules (graphs of different sizes) into 2 classes (toxic or non-toxic). I have been working on this project for a few weeks now taking suggestions from the comments above but regardless of what I tried the model terminates early after 12 epochs with an accuracy no higher than 60%. I believe the way I am pre-processing or formatting my data is where I am going wrong as I'm trying to use the model 'off the shelf'. I have 100 graphs (different number of nodes per graph) with 50 graphs for my 2 classes (Once my model is working I plan on using the full dataset of 1000 graphs). I'm trying to implement the 'hacky' version you mentioned at the beginning of this post by introducing a super node which is connected to all other nodes in a graph and initialized with a feature vector of 0s. I formatted my data such that each node in a graph has the same label as the graph (so if a graph is meant to be part of class 1, all nodes in that graph have the label of class 1). I'm now unsure how I should separate my data as training features with labels, training features without labels and the test set (obtaining x, tx and allx in the code). I split my data as 60% training set and 40% testing set, with an even spread over both classes. In the training set half of the super nodes are not given a label so that I can still train the model in a semi-supervised fashion. Unless I am going about this the wrong way and I should edit the code to work for supervised-learning. Even though I am not training the model batch-wise I assume I still have to use the block-diagonal adjacency matrix as described in the image you posted before? I also modified the way the accuracy is calculated by comparing the label prediction of the unlabeled super nodes with the label they should be predicted as. Again sorry for the long post ... and thanks in advance for any help! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYFJUCtmrENHkUcG3u_-4WkX0lZETks5uVG6agaJpZM4LofoM .

Sounds like a reasonable set-up. A dataset of 50 labeled graphs is extremely small, so even if you did everything correctly I wouldn’t expect much. It’s probably better to work with datasets on the order of several thousands of samples or more. Your final adjacency matrix should be block-diagonal if the nodes are ordered according to which graph they belong to. If you don’t order them in any particular way, the only important property is that there are no connections between samples, I.e. every connected component in the batch adjacency matrix corresponds to one graph. In terms of labeling: please don’t try to follow the data structure of these .tx files etc (this is from a different paper and not a particularly good way to structure the data)- just write your own data loader and put the data in any reasonable format (e.g. csv files). You only need to put a class label on this one global node per graph and otherwise keep the others unlabeled (Just vectors of 0s will do). Make sure to only put a loss on the labeled global nodes. Hope this helps … On Mon 27. Aug 2018 at 23:11 MortZx @.***> wrote: Hi @tkipf https://github.com/tkipf , first of all I find this work to be really interesting. My apologies in advance for asking yet another question on how to adapt your code for graph classification. I also understand that my you may not be that familiar with the code anymore and that my question might be too precise for you to provide help. I am relatively new to machine learning and I'm trying to implement your code to classify molecules (graphs of different sizes) into 2 classes (toxic or non-toxic). I have been working on this project for a few weeks now taking suggestions from the comments above but regardless of what I tried the model terminates early after 12 epochs with an accuracy no higher than 60%. I believe the way I am pre-processing or formatting my data is where I am going wrong as I'm trying to use the model 'off the shelf'. I have 100 graphs (different number of nodes per graph) with 50 graphs for my 2 classes (Once my model is working I plan on using the full dataset of 1000 graphs). I'm trying to implement the 'hacky' version you mentioned at the beginning of this post by introducing a super node which is connected to all other nodes in a graph and initialized with a feature vector of 0s. I formatted my data such that each node in a graph has the same label as the graph (so if a graph is meant to be part of class 1, all nodes in that graph have the label of class 1). I'm now unsure how I should separate my data as training features with labels, training features without labels and the test set (obtaining x, tx and allx in the code). I split my data as 60% training set and 40% testing set, with an even spread over both classes. In the training set half of the super nodes are not given a label so that I can still train the model in a semi-supervised fashion. Unless I am going about this the wrong way and I should edit the code to work for supervised-learning. Even though I am not training the model batch-wise I assume I still have to use the block-diagonal adjacency matrix as described in the image you posted before? I also modified the way the accuracy is calculated by comparing the label prediction of the unlabeled super nodes with the label they should be predicted as. Again sorry for the long post ... and thanks in advance for any help! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYFJUCtmrENHkUcG3u_-4WkX0lZETks5uVG6agaJpZM4LofoM .

Dear kipf, Can I expect an accuracy of 90 percent and above using this. I took almost 2500 samples and got 82 percent accuracy. Now I am planning to increase my dataset to 5000 samples to improve my accuracy value. Will it be helpful?

IMSUSHI commented 5 years ago

Hi @tkipf , I took about 200 different size graphs for graph-level classification and got a good result. Now I'm wondering how to use the trained model to predict the graphs without labels. Can I use the model to predict a single graph?

jzmclover commented 4 years ago

Hi @tkipf , now I have a dataset consists of 10k small networks. The nodes of each network is 19 and all of them share a same adj matrix. That is to say, only the features of nodes are different among those networks. My task is to classify them into two categories. So I want to know whether the two methods ("global nodes" and "global pooling") are suitable in this case or not? Thanks for your reply!

ltbd78 commented 3 years ago

This might be a dumb question but, how do you incorporate the loss function for the output pooling matrix? Base off the image are the outputs one hot encodings or actual class labels?

LiuYonyi commented 9 months ago

Hi, Your gcn batch operations are so great! It's used in the Graph Classification, but now, I want to use it in Node Classification. Do you think this is a feasible idea? Because I have a series of graphs, they have the same characteristic attribute.

3c116663ac271288f541964a4c7508d

tkipf commented 9 months ago

Thanks for your question! You can certainly use the same form of graph batching for node-level prediction tasks as well. You just need to skip the "output pooling matrix" step and directly apply your node classifier on top of the output of GCN(A, X). The block-diagonal structure of the adjacency matrix ensures that no information flows between different graphs.

tkipf / gcn

Graph Classification & Batchwise Training #4