RandomLinkSplit can cause a data leak when sampling negative training edges

Barcavin commented 1 year ago

🐛 Describe the bug

The train/test split transform for link prediction (RandomLinkSplit) has a bug for sampling the negative training edges. The current implementation is to sample negative training edges based on the edge_index before train/test split. Thus, one can sample all the possible negative edges in this setting and the rest of edges are just true positive edges in train/valid/test. This can cause a data leak that a model simply memorizes all the true negative edges.

https://github.com/pyg-team/pytorch_geometric/blob/d4b12973da1c53048d9d378361464c88a300d0a6/torch_geometric/transforms/random_link_split.py#L212-L214C38

A solution can be to perform a negative sampling for training purpose based only on the train/test splitted edge_index.

Environment

No response

rusty1s commented 1 year ago

Mh, this is a good point. On the other hand, I think it is important to avoid sampling true positives as negatives in RandomLinkSplit - I guess that would be something that users wouldn't expect. However, I agree that it is definitely not recommended to sample all negative edges as this indeed leaks information.

I am not entirely sure how to fix this. Do you think an option makes sense to turn this on and off? Maybe we should drop negative sampling support for training all together, as it is better done within each training iteration anyway.

Barcavin commented 1 year ago

Actually, even the negatives in validation set should also be sampled based on training edges rather than the full graph. Since the model can see the validation set during model selection, it should be treated the same as training. Performance on validation is much better than the test set is a common situation I encounter. So any way, there should be two negative sampling process, one for negatives in training/validation, one for those in testing.

But I agree that we may just drop the negative sampling support for training because I don't see any codebase utilizing this function (maybe it is relatively new?). However, considering that we may need one more negative sampling anyway, it is no harm to keep it as well.

rusty1s commented 1 year ago

Not totally sure I understand your concerns regarding validation set. The validation set should only include true negatives and true positives - otherwise it is a weird validation set :) IMO, I can't think of a good reason why this would introduce distribution shift across validation and test. It's ofc possible that we sample more harder negatives during testing which may explain a drop in performance, but this can also be the case the other way around.

Barcavin commented 1 year ago

Yeah, it makes sense to me.

Anyhow, I think it might be more appropriate to turn off the neg sampling for training, given that the current implementation can leak information and most folks are not actively using it.

On Fri, Mar 24, 2023 at 3:52 AM Matthias Fey @.***> wrote:

Not totally sure I understand your concerns regarding validation set. The validation set should only include true negatives and true positives - otherwise it is a weird validation set :) IMO, I can't think of a good reason why this would introduce distribution shift across validation and test. It's ofc possible that we sample more harder negatives during testing which may explain a drop in performance, but this can also be the case the other way around.

— Reply to this email directly, view it on GitHub https://github.com/pyg-team/pytorch_geometric/issues/6946#issuecomment-1482390682, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNQQIVX3N4AC7PJCAK3E4DW5VHE7ANCNFSM6AAAAAAV62UCNA . You are receiving this because you authored the thread.Message ID: @.***>

rusty1s commented 1 year ago

I think that's a valid concern. To fix this, we should just set the default value of add_negative_train_samples from True to False. Would that work for you?

Barcavin commented 1 year ago

Sounds good to me!

On Sat, Mar 25, 2023 at 10:10 AM Matthias Fey @.***> wrote:

I think that's a valid concern. To fix this, we should just set the default value of add_negative_train_samples from True to False. Would that work for you?

— Reply to this email directly, view it on GitHub https://github.com/pyg-team/pytorch_geometric/issues/6946#issuecomment-1483834295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGNQQIWGDGUYPVQXBJVFZ4DW534FDANCNFSM6AAAAAAV62UCNA . You are receiving this because you authored the thread.Message ID: @.***>

pyg-team / pytorch_geometric

RandomLinkSplit can cause a data leak when sampling negative training edges #6946

🐛 Describe the bug

Environment