tkipf / gae

Implementation of Graph Auto-Encoders in TensorFlow
MIT License
1.65k stars 351 forks source link

Edges in val_edges and test_edges are trained with label '0'? #48

Open YH-UtMSB opened 5 years ago

YH-UtMSB commented 5 years ago

Hi @tkipf ,

It seems model.reconstructions includes all the positive and negative edges, no matter in training, validation or test set, whereas the labels used during training only have "1" entries on _trainedges. It confuses me since it looks like "the model is trained to score "1" for _trainedges and "0" for _valedges and _testedges," how come the validation and test accuracy are that good? Would it be better to mask out _valedges and _testedges before feeding the model.reconstructions to the optimizer?

Looking forward to your reply. Thanks.

tkipf commented 5 years ago

This is a very good question -- this approach works as the model is typically not fitting the training data perfectly and hence has a certain inductive bias that allows it to generalize to the test/validation set edges. Furthermore, the number of negative edges is usually much larger than the number of positive edges, which further reduces the effect of initially providing the "wrong" negative edge labels.

Masking out val- and test-edges is problematic, as this would mean one has to know in advance which edges are in the validation and test set, which of course does not work like this on real-world problems.

On Sun, Aug 25, 2019 at 9:55 PM hylBen notifications@github.com wrote:

Hi @tkipf https://github.com/tkipf ,

It seems model.reconstructions includes all the positive and negative edges, no matter in training, validation or test set, whereas the labels used during training only have "1" entries on train_edges. It confuses me since it looks like "the model is trained to score "1" for train_edges and "0" for val_edges and test_edges," how come the validation and test accuracy are that good? Would it be better to mask out val_edges and test_edges before feeding the model.reconstructions to the optimizer?

Looking forward to your reply. Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/48?email_source=notifications&email_token=ABYBYYHCYCNAVMS3LVIDTVTQGNO4HA5CNFSM4IPLX4IKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HHJFETQ, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYBYYGG2KSH4GAP6F5LFSTQGNO4HANCNFSM4IPLX4IA .

YH-UtMSB commented 5 years ago

@tkipf Thank you so much for your reply! I'd apologize for a typo in my original question, about "masking out". I wonder that, in the training stage, will it be better to mask out the entire validation & test set, so that they do not contribute to the log-likelihood in loss function? But it won't surprise me if the effect is imperceptible, as you pointed out about the model inductivity and "tolerence" for some "wrong" negative edge labels.

tkipf commented 5 years ago

Yes, it will probably work better, but most likely not by much (depending on how large your dataset is). At the same time, this will "leak" validation and test set information into the training set, unless you mask out both positive and "negative" edges.

On Mon, Aug 26, 2019 at 10:42 AM hylBen notifications@github.com wrote:

@tkipf https://github.com/tkipf Thank you so much for your reply! I'd apologize for a typo in my original question, about "masking out". I wonder that, in the training stage, will it be better to mask out the entire validation & test set, so that they do not contribute to the log-likelihood in loss function? But it won't surprise me if the effect is imperceptible, as you pointed out about the model inductivity and "tolerence" for some "wrong" negative edge labels.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tkipf/gae/issues/48?email_source=notifications&email_token=ABYBYYBRGJGLVD4AVDUTNB3QGQIW7A5CNFSM4IPLX4IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5FDCRQ#issuecomment-524955974, or mute the thread https://github.com/notifications/unsubscribe-auth/ABYBYYHCDAUWO3FYSTRSIPLQGQIW7ANCNFSM4IPLX4IA .

YH-UtMSB commented 5 years ago

Yes, I meant to mask out both edges and non-edges in test & validation set. Thanks for all the detailed replies, now I have a much better understanding of this model.

cfpark00 commented 3 years ago

Hello, I have a perhaps related question. It seems that the model has the full graph in its GCN layer, but is trained so that the generator will match the remaining(without test,val edges) graph. This confuses me a little bit:

  1. Is it "fair" to give the model the full adjacency?
  2. Is it fine train to get the remaining graph? (This is answered by this thread)

I guess both of these issues have minor impact on what actually happens?