Undocumented unintuitive behaviour of RandomLinkSplit

fratajcz commented 9 months ago

📚 Describe the documentation issue

Hi!

I think the RandomLinkSplit class is crucial for reproducible results in link prediction use cases and so far can not be easily substitute by any splitting class that I have come across. However, the behaviour on undirected graphs is so unintuitive that it had me question my sanity for a little while. Coming from sklearn's train_test_split class I was expecting that the three objects returned by RandomLinkSplit are disjoint subsets of the initial edge_index. However, to arrive at this result, we have to perform some additional steps:

Example:

import torch
import torch_geometric
from torch_geometric.data import Data
from torch_geometric.transforms import RandomLinkSplit
from torch_geometric.utils import to_undirected
torch_geometric.seed_everything(2)

edge_index = torch.tensor([[0, 1, 1, 2, 2, 3, 3, 4, 4, 5],
                           [1, 0, 2, 1, 3, 2, 4, 3, 5, 4]])

data = Data(edge_index=edge_index, num_nodes=10)
print(data.is_undirected())

transform = RandomLinkSplit(num_val=0.2, num_test=0.2, is_undirected=True, add_negative_train_samples=False)
train_data, val_data, test_data = transform(data)

print("Edges:")
print(train_data.edge_index)
print(val_data.edge_index)
print(test_data.edge_index)

print("\nEdge Label Indices:")
print(train_data.edge_label_index)
print(val_data.edge_label_index)
print(test_data.edge_label_index)

print("\nEdge Label Indices after subsetting:")
print(train_data.edge_label_index[:, train_data.edge_label==1])
print(val_data.edge_label_index[:, val_data.edge_label==1])
print(test_data.edge_label_index[:, test_data.edge_label==1])

print("\nEdge Label Indices after subsetting and adding inverses:")
print(to_undirected(train_data.edge_label_index[:, train_data.edge_label==1]))
print(to_undirected(val_data.edge_label_index[:, val_data.edge_label==1]))
print(to_undirected(test_data.edge_label_index[:, test_data.edge_label==1]))

Output:

True
Edges:
tensor([[3, 4, 1, 4, 5, 2],
        [4, 5, 2, 3, 4, 1]])
tensor([[3, 4, 1, 4, 5, 2],
        [4, 5, 2, 3, 4, 1]])
tensor([[3, 4, 1, 0, 4, 5, 2, 1],
        [4, 5, 2, 1, 3, 4, 1, 0]])

Edge Label Indices:
tensor([[3, 4, 1],
        [4, 5, 2]])
tensor([[0, 0],
        [1, 8]])
tensor([[2, 1],
        [3, 3]])

Edge Label Indices after subsetting:
tensor([[3, 4, 1],
        [4, 5, 2]])
tensor([[0],
        [1]])
tensor([[2],
        [3]])

Edge Label Indices after subsetting and adding inverses:
tensor([[1, 2, 3, 4, 4, 5],
        [2, 1, 4, 3, 5, 4]])
tensor([[0, 1],
        [1, 0]])
tensor([[2, 3],
        [3, 2]])

Expected behaviour: train_data.edge_index, val_data.edge_index and test_data.edge_index are disjoint subsets of data.edge_index, as is shown in the last three print statements.

Observed behaviour: edge_index of the three return data objects are almost identical and I can't understand why that would be desired. Several steps are necessary to achieve the expected behaviour which requires good knowledge of the used class.

Suggest a potential alternative/fix

State in the documentation that the splitting does not happen on the returned edge_index tensors.

rusty1s commented 9 months ago

I agree we can do a better job at documenting RandomLinkSplit. Please note that only the resulting edge_label_index is guaranteed to be disjoint. edge_index (i.e., the edges used to perform message passing) are mostly shared (with the small exception that we are also using validation message passing edges during testing).

fratajcz commented 9 months ago

Ah, I think I'm beginning to see how the output is meant to be used, thanks for clarifying. It makes sense, just not in the way i expected.

pyg-team / pytorch_geometric

Undocumented unintuitive behaviour of RandomLinkSplit #8845

📚 Describe the documentation issue

Suggest a potential alternative/fix