Split Error in RandomLinkSplit

lmy86263 commented 2 years ago

🐛 Bug

When I use the RandomLinkSplit to split dataset MovieLens, I found that the split data is wrong.

To Reproduce

The link prediction task is as follows:

train_data, val_data, test_data = T.RandomLinkSplit(
        num_val=0.1,
        num_test=0.1,
        neg_sampling_ratio=0.0,
        edge_types=[('user', 'rates', 'movie')],
        rev_edge_types=[('movie', 'rev_rates', 'user')],
    )(data)

I get the following result:

train: 80670(this is right) val: 80670(wrong) test: 90753(wrong)

Expected behavior

The number of edges ('user', 'rates', 'movie') in this dataset is 100836. According to the ratio (0.8, 0.1, 0.1), we should get the split dataset as follows:

train: 80670(this is right) val: 10083(wrong) test: 10083(wrong)

Environment

PyG version (torch_geometric.__version__): 2.0.2
PyTorch version: (torch.__version__): 1.10.0
OS (e.g., Linux): MacOS
Python version (e.g., 3.9): 3.8
CUDA/cuDNN version: CPU
How you installed PyTorch and PyG (conda, pip, source): pip
Any other relevant information (e.g., version of torch-scatter): Not yet.

Additional context

I review the source code, I found the error may be made in the line 176 in RandomLinkSplit with wrong parameters.

rusty1s commented 2 years ago

I think this is totally correct. It seems like you are looking at the shapes of edge_index, while you may want to look at the shapes of edge_label and edge_label_index (which correctly model a 80/10/10 split ratio). Here, edge_index is solely used for message passing, i.e.,

for training, we exchange messages on all training edges
for validation, we exchange messages on all training edges
for testing, we exchange messages on all training and validation edges

Let me know if this resolves your concerns :)

lmy86263 commented 2 years ago

It is not completely solved yet. One question is that when the link occurred in the training, validation, testing at the same time. Is there an information leakage among different dataset, especially for link prediction?

rusty1s commented 2 years ago

You mean that the link appears during training both for message passing and ground-truth? I think it depends. For example, in the case that you want to classify edges into ratings, it's totally fine to use the knowledge of existence of edges during message passing (it would be different if you would use the knowledge of ratings used for supervision).

To completely eliminate any data leakage, have a look at the disjoint_train_ratio of RandomLinkSplit.

lmy86263 commented 2 years ago

Thx, this provides a reasonable interpretation for the split for link prediction.

shahinghasemi commented 2 years ago

Here, edge_index is solely used for message passing, i.e., for training, we exchange messages on all training edges for validation, we exchange messages on all training edges for testing, we exchange messages on all training and validation edges

Would you please @rusty1s elaborate on this that what do you mean by message passing phase for link prediction?

rusty1s commented 2 years ago

For link prediction with GNNs, we first perform message passing on the original graph and use the resulting node embeddings to infer the probability of new links. As such, we have links to perform message passing on (edge_index), and links which we want to train/evaluate against (edge_label_index). RandomLinkSplit takes of separating these two correctly.

shahinghasemi commented 2 years ago

@rusty1s According to this video for link prediction task we have 4 types of edges: training supervision edges, training message edges, validation edges and testing edges. I'm a little bit confused about training supervision edges and training message edges. Here's my question (context: in heterogeneous network):

What's the difference between training supervision edges and training message edges? I know they're both used in training phase however don't know the difference?
Can training supervision edges and training message edges have common edges? or they should be disjoint sets?

A simple example would help a lot! thanks in advance.

rusty1s commented 2 years ago

"Training message edges" are the edges that are used in the GNN part of your model: The edges that you use to exchange neighborhood information and to enhance your node representations. "Training supervision edges" are then used to train your final link predictor: Given a training supervision edge, you take the source and destination node representations obtained from a GNN and use them as input to predict the probability of a link.
This depends on the model and validation performance. In GAE (https://arxiv.org/abs/1611.07308), training supervision edges and training message edges denote the same set of edges. IN SEAL (https://arxiv.org/pdf/1802.09691.pdf), training supervision edges and training message edges are disjoint.

In general, I think using the same set of edges for message passing and supervision may lead to same data leakage in your training phase, but this depends on the power/expressiveness of your model. For example, GAE uses a GCN-based encoder and a dot-product based decoder. Both encoder and decoder have limited power, so the data leakage capabilities of the model are limited as well.

shahinghasemi commented 2 years ago

Thank you @rusty1s I guess I get the idea. My last question: is this correct? the test edges should not be included in both message edges and supervision edges in other words they're disjoint sets.

rusty1s commented 2 years ago

Yes, this is correct. Validation and test edges need to always be disjoint.

CocoGzh commented 2 years ago

For link prediction with GNNs, we first perform message passing on the original graph and use the resulting node embeddings to infer the probability of new links. As such, we have links to perform message passing on (edge_index), and links which we want to train/evaluate against (edge_label_index). RandomLinkSplit takes of separating these two correctly.

It seems that negative samples are automatically generated in edge_label and edge_label_index of the validation set and test set even "add_negative_train_samples=False". Is this to evaluate the model more fairly?

rusty1s commented 2 years ago

Yes, this is correct. For inference, we typically want to evaluate on the same set of positive and negative edges across epochs.

ashim-mahara commented 2 years ago

Sorry for hijacking the thread but does the RandomLinkSplit perform splits on edge_attr and the label tensor y too? If yes, how do I access the edge attr? BTW my output after splitting is:

split_transform = RandomLinkSplit(num_test = 0.2, num_val = 0.1, is_undirected=False)
train_data, val_data, test_data = split_transform(data)

print(train_data)

Data(x=[19129, 1], edge_index=[2, 1979514], edge_attr=[1979514, 80], y=[1979514], is_directed=True, edge_label=[3959028], edge_label_index=[2, 3959028])

print(val_data)

Data(x=[19129, 1], edge_index=[2, 1979514], edge_attr=[1979514, 80], y=[1979514], is_directed=True, edge_label=[565574], edge_label_index=[2, 565574])

print(test_data)

Data(x=[19129, 1], edge_index=[2, 2262301], edge_attr=[2262301, 80], y=[2262301], is_directed=True, edge_label=[1131150], edge_label_index=[2, 1131150])

I am sorry but I am having a hard time interpreting the output of the RandomLinkSplit function.

rusty1s commented 2 years ago

The split is performed based on edge_index and applied to all attributes that are identified as edge features (in your case edge_attr and y). It will also create edge_label and edge_label_index attributes, which will contain negative sampled edges and their labels. I hope this clarifies some of your doubts.

ashim-mahara commented 2 years ago

So how should I utilize the edge_label_index? I tried but the edge_level contains [0, 1] and edge_label_index contains [2, num_edges]. I am a bit confused as to how I can leverage those to split the edge_attr. I tried setting key = 'y' which results in a successful split of y with the desired outcome but not for edge_attr. Do you have a code snippet that can explain the process? Thanks for the prompt reply.

rusty1s commented 2 years ago

Note that edge_attr is already splitted as well. With key="y", you get the following behavior:

edge_index and edge_attr shall be used for message passing via a given GNN
edge_label_index and y shall be used for supervision/loss computation. That is, for each edge in edge_label_index, y denotes the ground-truth labels. Furthermore, additional labels are added for negative sampled edges.

ashim-mahara commented 2 years ago

However, y is related to the edge_attr. As in y = theta(edge_attr). So for each (source, edge_attr, destination), I would like to compute a label y. y could also be interpreted as an edge_label. I am sorry but I am very new to GNNs and trying to learn.

rusty1s commented 2 years ago

In that case, you might want to drop the RandomLinkSplit transform (which is more applicable for a link prediction scenario in which links in the graph are actually missing), and perform a standard random splitting on your own:

perm = torch.randperm(data.num_edges)
data.train_idx = perm[:int(0.8 * data.num_edges)]
data.val_idx = perm[int(0.8 * data.num_edges):int(0.9 * data.num_edges)]
data.test_idx = perm[int(0.9 * data.num_edges):]

Let me know if that works for you.

ashim-mahara commented 2 years ago

That works. Thanks! That needs to be saved. I could contribute to the docs but I don't know how to.

rusty1s commented 2 years ago

Sounds good. We could just add this bit of information as a note to the RandomLinkSplit documentation, see here.

ashim-mahara commented 2 years ago

RandomLinkSplit transform is primarily used for a Link Prediction Scenario whereby the task is to predict missing links in a graph. at line 22.

Thought it will be better at the top as a piece of contextual information rather than at the bottom.

rusty1s commented 2 years ago

Please feel free to contribute this in a PR to credit you. I can fine-tune it afterwards :)

ashim-mahara commented 2 years ago

Feel a bit stupid to open a PR for such a small commit. Are there any task boards I can view? I'll look if I can make any other contributions.

rusty1s commented 2 years ago

Small PRs are the best :) Otherwise, we are also looking for some help to fill our "Dataset Cheatsheet".

ashim-mahara commented 2 years ago

Okay. I'll see what I can do :)

SimonCrouzet commented 2 years ago

Sorry for updating the thread but I just want to be sure that I'm understanding correctly the insights discussed by @rusty1s and @katyansun.

If I'm understanding well:

if we want to label edges ourselves, the best way to do it is to set the label as data[('user', 'rates', 'movie')].y, no matter if the label is a feature part of data[('user', 'rates', 'movie')].edge_attr or no (therefore if the label is part of edge_attr, it's up to us to either remove it from edge_attr or use layers not propagating edge_attr to prevent any data contamination) and set key=y when using RandomLinkSplit
Negative test edges, i.e. "inexistent edges" are always added to the val and test splits, and after applying RandomLinkSplit on data with binary edge labels ( y = {0 or 1} ) we ends up with edge_label being 0 for inexistant edge, 1 for edge with y=0 and 2 for edge with y=1

Then the usage of those edge_label depends of us:

if for our task we consider that negative test edges are indeed inexistent (i.e. if we know the entire graph but would like to be able to predict links and their labels) we are computing our loss and metrics out of every type of edge_label
if our task is to explore the missing links (i.e. from the existing edges we know we would like to guess new links, as we consider inexistent link as missing data to be retrieved) we have to compute our loss and metrics without considering negative edges (edge_label=0). For this case, if we want to perform regression rather than classification (y being continuous), we then can not use RandomLinkSplit and should perform standard random splitting on our own following the suggestion made by @rusty1s , correct?

rusty1s commented 2 years ago

Yes, that is correct.
If you have labels y = { 0, 1 }, then after negative sampling they will be increased to y = { 1, 2 }. Whenever you input edge_label_index or y into the split function, we assume that the task you are trying to solve is a edge-level classification task.
If the task is to find missing links, you usually would need need an edge_label_index, but rather treat positives links as label 1, and negative links as label 0.

Let me know if this makes sense.

SimonCrouzet commented 2 years ago

It indeed makes sense, thanks for clarifying!

If that makes sense, I could add some lines to the doc and/or write a short function to split edges when we want to find missing links

rusty1s commented 2 years ago

Sure, happy to extend the documentation in this regard :)

songsong0425 commented 1 year ago

Sorry for repeat this thread, @rusty1s . I have simple question about the mismatch between the number of splitted dataset. When I try link prediction task and run the RandomLinkSplit in torch_geometric.transforms, it returned different number of edges as below:

data
# Data(x=[47957, 256], edge_index=[2,2161412]
# Train : Val : Test = 7 : 1 : 2
# Expected number of splitted dataset: train(1512988), val(216141), test(432282)

# Case1: using RandomLinkSplit
transform = RandomLinkSplit(num_val=0.1, num_test=0.2, is_undirected=True, split_labels=True)
train_data, val_data, test_data = transform(data)

train_data
# Data(x=[47957, 256], edge_index=[2, 2222032], pos_edge_label=[1111016], pos_edge_label_index=[2, 1111016], neg_edge_label=[1111016], neg_edge_label_index=[2, 1111016])
val_data
# Data(x=[47957, 256], edge_index=[2, 2222032], pos_edge_label=[158716], pos_edge_label_index=[2, 158716], neg_edge_label=[158716], neg_edge_label_index=[2, 158716])
test_data
# Data(x=[47957, 256], edge_index=[2, 2539464], pos_edge_label=[317432], pos_edge_label_index=[2, 317432], neg_edge_label=[317432], neg_edge_label_index=[2, 317432])

Although I read the whole comments in this thread, I'm not sure why there were missed edges. Is it due to the isolated edges which can't perform message passing? Also, If I split the edges manually, will it evoke any problem during the model training, validation, and test?

# Case2: manual splitting
perm = torch.randperm(data.num_edges)
data.train_idx = perm[:int(0.7 * data.num_edges)]
data.val_idx = perm[int(0.7 * data.num_edges):int(0.8 * data.num_edges)]
data.test_idx = perm[int(0.8 * data.num_edges):]

data
# Data(x=[47957, 256], edge_index=[2, 2161412], train_idx=[1512988], val_idx=[216141], test_idx=[432283])

rusty1s commented 1 year ago

This is likely due to the is_undirected option since it will only return the upper half of edges for supervision. Is your graph really undirected?

LuisaWerner commented 1 year ago

I also have a question in the context of RandomLinkSplit:

By default, no negative edges are sampled for the training set in RandomLinkSplit. However, I saw in the example script for link prediction here that negative edges are sampled in the train method.

neg_edge_index = negative_sampling(
        edge_index=train_data.edge_index, num_nodes=train_data.num_nodes,
        num_neg_samples=train_data.edge_label_index.size(1), method='sparse')

Would the behavior be the same if I don't sample negative edges in the train method but instead modify T.RandomLinkSplit(num_val=0.05, num_test=0.1, is_undirected=False, add_negative_train_samples=True)? In other words, does setting add_negative_train_samples = True do the same as adding the negative sampling to the training method?

rusty1s commented 1 year ago

It is not the same. If you sample negative training edges in RandomLinkSplit, these negative samples will be fixed for the whole training procedure. Negative sampling on-the-fly here instead achieves that we are guaranteed to always see a different set of negative samples during training, thus providing a better learning signal (in general).

LuisaWerner commented 1 year ago

Thanks for clarifying @rusty1s

pyg-team / pytorch_geometric