Error with RandomEdgeSplit for multilabel edge classification

sadrahkm commented 5 months ago

🐛 Describe the bug

Recently, I've been dealing with a multi-label edge classification problem. In other words, an edge can have more than one label. So I implemented a simple GNN model to see if I get good results or not.

I have 935 types of labels and have encoded them using the MultiLabelBinarizer method in sklearn. I have tested and I'm sure that all the labels are 0 or 1.

But after splitting the edges using RandomEdgeSplit, I noticed that there are more than two types of labels in the test and validation tests. I mean in the train set, there are 1 and 0, but in the validation set there 0, 1, 2. This makes the work a little hard. In the following screenshot, I have shown this. The first cell is the original data which is encoded with MultiLabelBinarizer. The next three cells are train/val/test sets, respectively. These train/val/test sets are splitted using the RandomEdgeSplit that I've provided in the code block.

For example, I want to compute the AUC score in the test process. I have attached the code and errors that I've got. I don't what I should do why the edge splitter function returning more than two types of labels. I think it should only have 0 or 1. I would appreciate your help in this regards.

transform = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    disjoint_train_ratio=None,
    add_negative_train_samples=False,
)
train_data, val_data, test_data = transform(data)

from torchmetrics.classification import MultilabelAUROC
@torch.no_grad()
def test(data):
    model.eval()
    z = model.encode(data.x, data.edge_index)
    out = model.decode(z, data.edge_label_index)

    ml_auroc = MultilabelAUROC(num_labels=935, average="macro", thresholds=None)
    auc = ml_auroc(out.cpu(), data.edge_label.cpu())

    return auc

for epoch in range(1, 100):
    loss = train()
    val_auc = test(val_data)
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}, Val AUC: {val_auc:.4f}, Val AUPRRRC: {val_auprc:.4f}')

RuntimeError: Detected the following values in `target`: tensor([0, 1, 2]) but expected only the following values [0, 1].

Versions

Collecting environment information... PyTorch version: 2.2.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 12 (bookworm) (x86_64) GCC version: (Debian 12.2.0-14) 12.2.0 Clang version: Could not collect CMake version: version 3.25.1 Libc version: glibc-2.36

Python version: 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] (64-bit runtime) Python platform: Linux-6.1.0-20-amd64-x86_64-with-glibc2.36 ...

keeganq commented 4 months ago

I was able to reproduce this problem with a minimal example. The root cause is that when add_negative_train_samples=False, negative sampling still occurs for val and test examples.

https://github.com/pyg-team/pytorch_geometric/blob/d2f6ebac2dfde8d8d17ab8a5b94dba657a103aab/torch_geometric/transforms/random_link_split.py#L223-L232

Unfortunately, this not only adds negative edges to the val_data and test_data, but also means that their edge labels are incremented by 1, whereas the train_data labels are unchanged. In your example, label 1 in val_data corresponds with label 0 in train_data, and so on. Label 0 in val_data indicates a negative link.

This seems like a very confusing kwarg, and possibly an unintended result? Would be happy to submit a PR to try to fix this.

keeganq commented 4 months ago

@sadrahkm A quick workaround is to pass the kwarg neg_sampling_ratio=0. to T.RandomLinkSplit. This will prevent negative sampling for the validation and test sets, and will also preserve the original labels in your dataset.

sadrahkm commented 4 months ago

Thank you @keeganq for your help

Right, I hadn't noticed that the add_negative_train_samples option is only working for training samples, and the validation/test sets are automatically considered for negative sampling.

Yes, putting neg_sampling_ratio=0 would fix it. But I think it should be clarified in the documentation to avoid any confusion like this.

sadrahkm commented 4 months ago

I think if we want to have negative samples for train/val/test sets, there would be a problem with this issue. Because in that case, we would have to set add_negative_train_samples=True as well as putting neg_sampling_ratio=2.0. By doing this, val/test would have more than 2 labels, as I mentioned in the problem statement.

pyg-team / pytorch_geometric

Error with RandomEdgeSplit for multilabel edge classification #9262

🐛 Describe the bug

Versions