tkipf / gae

Implementation of Graph Auto-Encoders in TensorFlow
MIT License
1.64k stars 351 forks source link

Loss function in optimizer.py #20

Open zzheyu opened 5 years ago

zzheyu commented 5 years ago

Hi @tkipf,

Thank you for sharing the implementation.

In the OptimizerVAE class, when defining the KL divergence, I think there is an (1/num_nodes)^2 term extra. One num_nodes comes in (0.5 / num_nodes), and the other is introduced by tf.reduce_mean.

This contradicts the results in Auto-Encoding Variational Bayes by Kingma and Welling (Appendix B, Solution of -KL, Gaussian case).

Could you expound on this a bit more?

Thanks!

tkipf commented 5 years ago

This is because the cross-entropy loss applies to N^2 terms, i.e. all potential edges, while the KL term only applies to N terms, i.e. all nodes. This normalization makes sure they have a comparable scale.

zzheyu commented 5 years ago

Thank you.

But why would you divide the KL term by N^2 when it is the sum of N terms? Would it make more sense to divide it by N (so that we have the average of the sum, like in the cross-entropy term)?

tkipf commented 5 years ago

I think you are right, this makes the KL term indeed quite small in comparison. The model can still be seen as a beta-VAE with a very small beta parameter. I’ll have to check what scale these terms have in practice when runnning the model to see if there’s an issue.

Overall, a Gaussian prior is not a good choice in any case in combination with a dot-product decoder, as mentioned in our follow-up paper: https://nicola-decao.github.io/s-vae.html

I would recommend to run the GAE model in the non-probabilistic variant or use a hyperspherical posterior/prior.

zzheyu commented 5 years ago

Thanks for the suggestions:)

YH-learning commented 5 years ago

Hi @tkipf, Thanks for your shares. When running the code with the big beta parameter.(i.e., _kl_loss + entloss), the performance is not as good as that using _kl_loss / num_nodes + entloss, I am very confused about this.

Could you explain this more or Have you solved this issue?

Thanks again for your sharing.

tkipf commented 5 years ago

A large beta acts as a regularizer / soft constraint on the latent code, so this will typically harm link prediction performance if the parameter is too large. You should even get reasonable results with beta=0 (i.e. no KL term at all), as the model is typically sufficiently regularized via negative sampling (i.e. we choose random negative links).

On Tue, Dec 25, 2018 at 10:24 AM YH-learning notifications@github.com wrote:

Hi @tkipf https://github.com/tkipf, Thanks for your shares. When running the code with the big beta parameter.(i.e., kl_loss + ent_loss), the performance is not as good as that using kl_loss / num_nodes + ent_loss, I am very confused about this.

Could you explain this more or Have you solved this issue?

Thanks again for your sharing.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-449830369, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYNIkpy3J3naeKXeBYdKfoEf1UBhAks5u8e7HgaJpZM4ZLuk4 .

zzheyu commented 5 years ago

Hi @tkipf,

Have you tried to output the reconstructed adjacency matrix based on the training data? I have found the reconstruction deviated a lot from the original matrix (for the cora graph and the citeseer graph), the number of edges in the reconstructed matrix is 300 times more than the original adj matrix (cora graph).

Of course this could be a mistake, please do correct me if this is not the case.

Thanks

tkipf commented 5 years ago

Can you paste the code you used to reconstruct the adjacency matrix?

On Fri, Jan 18, 2019 at 6:14 PM Zheyu Zhang notifications@github.com wrote:

Hi @tkipf https://github.com/tkipf,

Have you tried to output the reconstructed adjacency matrix based on the training data? I have found the reconstruction deviated a lot from the original matrix (for the cora graph and the citeseer graph), the number of edges in the reconstructed matrix is 300 times more than the original adj matrix (cora graph).

Of course this could be a mistake, please do correct me if this is not the case.

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-455620619, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYJqyhDjcRMGGqetb_57EBTYxd64Sks5vEgD-gaJpZM4ZLuk4 .

dawnranger commented 5 years ago

Can you paste the code you used to reconstruct the adjacency matrix?

I'm also confused about this. The results of the reconstructed matrix look like this:

Epoch: 0010 TP=0013246 FN=0000018 FP=4508390 TN=2811610 Precision=0.0029 Recall=0.9986 Epoch: 0020 TP=0013238 FN=0000026 FP=3539226 TN=3780774 Precision=0.0037 Recall=0.9980 Epoch: 0030 TP=0013248 FN=0000016 FP=3282812 TN=4037188 Precision=0.0040 Recall=0.9988 Epoch: 0040 TP=0013254 FN=0000010 FP=3176094 TN=4143906 Precision=0.0042 Recall=0.9992 Epoch: 0050 TP=0013256 FN=0000008 FP=3168532 TN=4151468 Precision=0.0042 Recall=0.9994 Epoch: 0060 TP=0013258 FN=0000006 FP=3138802 TN=4181198 Precision=0.0042 Recall=0.9995 Epoch: 0070 TP=0013264 FN=0000000 FP=3110030 TN=4209970 Precision=0.0042 Recall=1.0000 Epoch: 0080 TP=0013264 FN=0000000 FP=3082102 TN=4237898 Precision=0.0043 Recall=1.0000 Epoch: 0090 TP=0013264 FN=0000000 FP=3063600 TN=4256400 Precision=0.0043 Recall=1.0000 Epoch: 0100 TP=0013264 FN=0000000 FP=3061570 TN=4258430 Precision=0.0043 Recall=1.0000 Epoch: 0110 TP=0013264 FN=0000000 FP=3065990 TN=4254010 Precision=0.0043 Recall=1.0000 Epoch: 0120 TP=0013264 FN=0000000 FP=3069514 TN=4250486 Precision=0.0043 Recall=1.0000 Epoch: 0130 TP=0013264 FN=0000000 FP=3075558 TN=4244442 Precision=0.0043 Recall=1.0000 Epoch: 0140 TP=0013264 FN=0000000 FP=3084226 TN=4235774 Precision=0.0043 Recall=1.0000 Epoch: 0150 TP=0013264 FN=0000000 FP=3092052 TN=4227948 Precision=0.0043 Recall=1.0000 Epoch: 0160 TP=0013264 FN=0000000 FP=3097308 TN=4222692 Precision=0.0043 Recall=1.0000 Epoch: 0170 TP=0013264 FN=0000000 FP=3100544 TN=4219456 Precision=0.0043 Recall=1.0000 Epoch: 0180 TP=0013264 FN=0000000 FP=3101718 TN=4218282 Precision=0.0043 Recall=1.0000 Epoch: 0190 TP=0013264 FN=0000000 FP=3103924 TN=4216076 Precision=0.0043 Recall=1.0000 Epoch: 0200 TP=0013264 FN=0000000 FP=3106388 TN=4213612 Precision=0.0043 Recall=1.0000

The codes are:

    preds = tf.cast(tf.greater_equal(tf.sigmoid(adj_preds), 0.5), tf.int32)
    labels = tf.cast(adj_out, tf.int32)
    self.accuracy = tf.reduce_mean(tf.cast(tf.equal(preds, labels),tf.float32))
    self.TP = tf.count_nonzero(preds * labels)
    self.FP = tf.count_nonzero(preds * (labels - 1))
    self.FN = tf.count_nonzero((preds - 1) * labels)
    self.TN = tf.count_nonzero((preds - 1) * (labels - 1))
    self.precision = self.TP / (self.TP + self.FP) 
    self.recall = self.TP / (self.TP + self.FN) 

It seems that the inner product module tend to reconstruct far more edges than expected. The optimization procedure is reducing FP and increasing TN, but nothing to do with TP.

tkipf commented 5 years ago

Maybe your training set is unbalanced? It looks like this is not necessarily a problem with this code release...

On Sun, Jan 20, 2019 at 3:26 AM dawnranger notifications@github.com wrote:

Can you paste the code you used to reconstruct the adjacency matrix?

I'm also confused about this. The results of the reconstructed matrix looks like this:

Epoch: 0010 TP=0013246 FN=0000018 FP=4508390 TN=2811610 Precision=0.0029 Recall=0.9986 Epoch: 0020 TP=0013238 FN=0000026 FP=3539226 TN=3780774 Precision=0.0037 Recall=0.9980 Epoch: 0030 TP=0013248 FN=0000016 FP=3282812 TN=4037188 Precision=0.0040 Recall=0.9988 Epoch: 0040 TP=0013254 FN=0000010 FP=3176094 TN=4143906 Precision=0.0042 Recall=0.9992 Epoch: 0050 TP=0013256 FN=0000008 FP=3168532 TN=4151468 Precision=0.0042 Recall=0.9994 Epoch: 0060 TP=0013258 FN=0000006 FP=3138802 TN=4181198 Precision=0.0042 Recall=0.9995 Epoch: 0070 TP=0013264 FN=0000000 FP=3110030 TN=4209970 Precision=0.0042 Recall=1.0000 Epoch: 0080 TP=0013264 FN=0000000 FP=3082102 TN=4237898 Precision=0.0043 Recall=1.0000 Epoch: 0090 TP=0013264 FN=0000000 FP=3063600 TN=4256400 Precision=0.0043 Recall=1.0000 Epoch: 0100 TP=0013264 FN=0000000 FP=3061570 TN=4258430 Precision=0.0043 Recall=1.0000 Epoch: 0110 TP=0013264 FN=0000000 FP=3065990 TN=4254010 Precision=0.0043 Recall=1.0000 Epoch: 0120 TP=0013264 FN=0000000 FP=3069514 TN=4250486 Precision=0.0043 Recall=1.0000 Epoch: 0130 TP=0013264 FN=0000000 FP=3075558 TN=4244442 Precision=0.0043 Recall=1.0000 Epoch: 0140 TP=0013264 FN=0000000 FP=3084226 TN=4235774 Precision=0.0043 Recall=1.0000 Epoch: 0150 TP=0013264 FN=0000000 FP=3092052 TN=4227948 Precision=0.0043 Recall=1.0000 Epoch: 0160 TP=0013264 FN=0000000 FP=3097308 TN=4222692 Precision=0.0043 Recall=1.0000 Epoch: 0170 TP=0013264 FN=0000000 FP=3100544 TN=4219456 Precision=0.0043 Recall=1.0000 Epoch: 0180 TP=0013264 FN=0000000 FP=3101718 TN=4218282 Precision=0.0043 Recall=1.0000 Epoch: 0190 TP=0013264 FN=0000000 FP=3103924 TN=4216076 Precision=0.0043 Recall=1.0000 Epoch: 0200 TP=0013264 FN=0000000 FP=3106388 TN=4213612 Precision=0.0043 Recall=1.0000

The codes are:

preds = tf.cast(tf.greater_equal(tf.sigmoid(adj_preds), 0.5), tf.int32)
labels = tf.cast(adj_out, tf.int32)
self.accuracy = tf.reduce_mean(tf.cast(tf.equal(preds, labels),tf.float32))
self.TP = tf.count_nonzero(preds * labels)
self.FP = tf.count_nonzero(preds * (labels - 1))
self.FN = tf.count_nonzero((preds - 1) * labels)
self.TN = tf.count_nonzero((preds - 1) * (labels - 1))
self.precision = self.TP / (self.TP + self.FP)
self.recall = self.TP / (self.TP + self.FN)

It seems that the inner product module rend to reconstruct far more edges than expected.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-455831943, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYCmMRMn-xF1_AT5nu09kJkpOv6nBks5vE9O6gaJpZM4ZLuk4 .

dawnranger commented 5 years ago

In the case of adjacency matrix reconstruction, I think the training set is always unbalanced. For example, in cora dataset, the size of adjacency matrix is 2708*2708=7333264, but there are only 5429 edges exists. The positive-negative rate is 5429*2:(2708*2708-5429*2)=1:674.37, which is the pos_weight in your code, that's why you are using tf.weighted_cross_entropy_with_logits rather than tf.sigmoid_cross_entropy_with_logits.

Maybe your training set is unbalanced? It looks like this is not necessarily a problem with this code release...

tkipf commented 5 years ago

I see, I think I misunderstood the question - I assumed that this was related to some custom adaptation of the model or the use of some other dataset.

Note that 0.5 is not necessarily the best threshold for edge classification in this case. If you want to find the prediction with the best precision/recall trade-off, you can plot a precision/recall curve and find the optimal threshold value for your particular case. This should hopefully give more reasonable results.

On Mon, Jan 21, 2019 at 1:56 AM dawnranger notifications@github.com wrote:

In the case of adjacency matrix reconstruction, I think the training set is always unbalanced. For example, in cora dataset, the size of adjacency matrix is 27082708=7333264, but there are only 5429 edges exists. The positive-negative rate is 54292:(27082708-54292)=1:674.37, which is the pos_weight https://github.com/tkipf/gae/blob/master/gae/train.py#L77 in your code, that's why you are using tf.weighted_cross_entropy_with_logits rather than tf.sigmoid_cross_entropy_with_logits.

Maybe your training set is unbalanced? It looks like this is not necessarily a problem with this code release...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-455919345, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYM56khcpzxGvFuxuwLnQwb0R_30Pks5vFRAzgaJpZM4ZLuk4 .

zzheyu commented 5 years ago

Hi @tkipf,

Thank you for your suggestion, I have tried on various thresholds. But it seems the reconstructed adjacency matrix is still very noisy even when the threshold is above 0.9.

However, I think the problem is not with the threshold. The performance on validation and test sets in the paper are very good due to these sets being very balanced (half edges and half non-edges). To justify this, I have increased the number of non-edges in the validation and the test sets (5 times, 30 times and 100 times of the number of edges), to simulate the sparsity of the graph. And the average precision scores dropped significantly as the number of non-edges increased. Also I have evaluated F1 score alongside on val/test sets, and they are usually quite low.

Below are the code for the reconstruction. They are identical to those when you have used to evaluate validation and test performance.

feed_dict.update({placeholders['dropout']: 0})
pred = sess.run(model.reconstructions, feed_dict=feed_dict)
A_recon_prob_vec = 1 / (1 + np.exp(-pred))
A_recon_prob = A_recon_prob_vec.reshape(num_nodes,-1)
A_recon_prob = A_recon_prob - np.diag(np.diag(A_recon_prob))

And here is the modification to sample more non-edges in val and test sets.

# Original
val_edges_false = []
while len(val_edges_false) < len(val_edges):
# Modified
val_edges_false = []
while len(val_edges_false) < non_edges * len(val_edges):
# non_edges is an integer indicating the multiple

Thanks

tkipf commented 5 years ago

Thanks, this is indeed a very insightful analysis. It would be interesting to see what model/loss adaptions are necessary for this model to generalize in more realistic scenarios.

We have evaluated variants of GAEs for knowledge base completion ( https://arxiv.org/abs/1703.06103) and recommender systems ( https://arxiv.org/abs/1706.02263), where the model could perform very well on more standard benchmarks. So I think it is not necessarily an issue with the model per se. I think this issue could probably be fixed by training with negative sampling (instead of using a weighted cross entropy objective) -- as also mentioned in the GAE workshop paper.

On Wed, Jan 23, 2019 at 7:11 PM Zheyu Zhang notifications@github.com wrote:

Hi @tkipf https://github.com/tkipf,

Thank you for your suggestion, I have tried on various thresholds. But it seems the reconstructed adjacency matrix is still very noisy even when the threshold is above 0.9.

However, I think the problem is not with the threshold. The performance on validation and test sets in the paper are very good due to these sets being very balanced (half edges and half non-edges). To justify this, I have increased the number of non-edges in the validation and the test sets (5 times, 30 times and 100 times of the number of edges), to simulate the sparsity of the graph. And the average precision scores dropped significantly as the number of non-edges increased. Also I have evaluated F1 score alongside on val/test sets, and they are usually quite low.

Below are the code for the reconstruction. They are identical to those when you have used to evaluate validation and test performance.

feed_dict.update({placeholders['dropout']: 0}) pred = sess.run(model.reconstructions, feed_dict=feed_dict) A_recon_prob_vec = 1 / (1 + np.exp(-pred)) A_recon_prob = A_recon_prob_vec.reshape(num_nodes,-1) A_recon_prob = A_recon_prob - np.diag(np.diag(A_recon_prob))

And here is the modification to sample more non-edges in val and test sets.

Original

val_edges_false = []while len(val_edges_false) < len(val_edges):

Modified

val_edges_false = []while len(val_edges_false) < non_edges * len(val_edges):# non_edges is an integer indicating the multiple

Thanks

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-456907910, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYC5NUv1s0ij2fWN7GzQvw8ZjZj8Iks5vGKXdgaJpZM4ZLuk4 .

zzheyu commented 5 years ago

Thanks for clarifying and your further suggestions.

jlevy44 commented 5 years ago

I've been exploring a few real world social networks using GAE/VGAE and am running into this very problem. Is there a consensus on a solution? Number of positive edges are far less than the negative edges. I've just been adjusting my threshold, which has helped some, but ideally, I at least should be able to overfit on the adjacency itself and recover the original adjacency. This is not so.

I have a 33 node network and a 71 node network.

tkipf commented 5 years ago

You can try exp(-d(x,y)) as a scoring function for links where d(x,y) is the Euclidean distance between two node embedding vectors. This might be less susceptible to producing more or less densely connected graphs than the scoring function sigmoid(x^T y) (based on an inner product).

On Mon 1. Apr 2019 at 17:49 Joshua Levy notifications@github.com wrote:

I've been exploring a few real world social networks using GAE/VGAE and am running into this very problem. Is there a consensus on a solution? Number of positive edges are far less than the negative edges. I've just been adjusting my threshold, which has helped some, but ideally, I at least should be able to overfit on the adjacency itself and recover the original adjacency. This is not so.

I have a 33 node network and a 71 node network.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-478657131, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYNwvshGpTtWevZa93kBRNX9EuOPvks5vcjiFgaJpZM4ZLuk4 .

jlevy44 commented 5 years ago

Is there literature to support the use of this?

tkipf commented 5 years ago

I recommend having a look at "A Tutorial on Energy-Based Learning" by LeCun et al.. Using a scoring function like exp(-d(x,y)) essentially boils down to an energy-based model with d(x,y) being the energy associated with a particular edge. Energy-based losses have been explored in the context of link prediction before, e.g., in https://arxiv.org/abs/1707.03815 . Scoring functions in terms of inner product, on the other hand, come up in the context of graphons (https://en.wikipedia.org/wiki/Graphon) which are known to produce "dense blobs" when queried as a generative model.

On Mon, Apr 1, 2019 at 9:43 PM Joshua Levy notifications@github.com wrote:

Is there literature to support the use of this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-478739882, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYOMxVBEP3JHxjpubuFqoOQTjDCW0ks5vcm9XgaJpZM4ZLuk4 .

XuanHeIIIS commented 5 years ago

Hi @tkipf, Thanks for your nice work! I noticed that most of the GCNs were trained based on optimizing the structure (X) reconstruction error. So did you tried to train GCN by minimizing the feature (A) reconstruction error ? If you tried, can you share more details about it? Thanks!

tkipf commented 5 years ago

Yes, we've played around with model variants where we also reconstruct the node features (X) or include label information in a semi-supervised auto-encoder setting. Both approaches usually help improve the overall performance.

On Tue, Apr 2, 2019 at 10:22 AM Andy He notifications@github.com wrote:

Hi @tkipf https://github.com/tkipf, Thanks for your nice work! I noticed that most of the GCNs were trained based on optimizing the structure (X) reconstruction error. So did you tried to train GCN by minimizing the feature (A) reconstruction error ? If you tried, can you share more details about it? Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-478915428, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYMAIRcFHx0Ut6V-ec06sgg4pE5dbks5vcyFmgaJpZM4ZLuk4 .

XuanHeIIIS commented 5 years ago

Can you share some examples which reconstruct the node features? I am a little confused about the reconstruction process. Thanks!

tkipf commented 5 years ago

You can use a node-level MLP to reconstruct node features from the final decoder GCN outputs. If node features are binary, then simply using a binary cross entropy loss works well. Otherwise for continuous features, MSE (mean-squared error) works well.

On Tue, Apr 2, 2019 at 10:43 AM Andy He notifications@github.com wrote:

Can you share some examples which reconstruct the node features? I am a little confused about the reconstruction process. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/20#issuecomment-478922904, or mute the thread https://github.com/notifications/unsubscribe-auth/AHAcYNDV6H_vQ_6Pc59HFONW8I8l_PdWks5vcyZRgaJpZM4ZLuk4 .

Yumlembam commented 3 years ago

exp

what should I used for the activation z*z^T if I use exp(-d(x,y)) as the lost @tkipf .