Open sadrahkm opened 5 months ago
I was able to reproduce this problem with a minimal example. The root cause is that when add_negative_train_samples=False
, negative sampling still occurs for val
and test
examples.
Unfortunately, this not only adds negative edges to the val_data
and test_data
, but also means that their edge labels are incremented by 1, whereas the train_data
labels are unchanged. In your example, label 1
in val_data
corresponds with label 0
in train_data
, and so on. Label 0
in val_data
indicates a negative link.
This seems like a very confusing kwarg, and possibly an unintended result? Would be happy to submit a PR to try to fix this.
@sadrahkm A quick workaround is to pass the kwarg neg_sampling_ratio=0.
to T.RandomLinkSplit
. This will prevent negative sampling for the validation and test sets, and will also preserve the original labels in your dataset.
Thank you @keeganq for your help
Right, I hadn't noticed that the add_negative_train_samples
option is only working for training samples, and the validation/test sets are automatically considered for negative sampling.
Yes, putting neg_sampling_ratio=0
would fix it. But I think it should be clarified in the documentation to avoid any confusion like this.
I think if we want to have negative samples for train/val/test sets, there would be a problem with this issue. Because in that case, we would have to set add_negative_train_samples=True
as well as putting neg_sampling_ratio=2.0
. By doing this, val/test would have more than 2 labels, as I mentioned in the problem statement.
🐛 Describe the bug
Recently, I've been dealing with a multi-label edge classification problem. In other words, an edge can have more than one label. So I implemented a simple GNN model to see if I get good results or not.
I have 935 types of labels and have encoded them using the MultiLabelBinarizer method in sklearn. I have tested and I'm sure that all the labels are 0 or 1.
But after splitting the edges using
RandomEdgeSplit
, I noticed that there are more than two types of labels in the test and validation tests. I mean in the train set, there are 1 and 0, but in the validation set there 0, 1, 2. This makes the work a little hard. In the following screenshot, I have shown this. The first cell is the original data which is encoded with MultiLabelBinarizer. The next three cells are train/val/test sets, respectively. These train/val/test sets are splitted using the RandomEdgeSplit that I've provided in the code block.For example, I want to compute the AUC score in the test process. I have attached the code and errors that I've got. I don't what I should do why the edge splitter function returning more than two types of labels. I think it should only have 0 or 1. I would appreciate your help in this regards.
Versions
Collecting environment information... PyTorch version: 2.2.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 12 (bookworm) (x86_64) GCC version: (Debian 12.2.0-14) 12.2.0 Clang version: Could not collect CMake version: version 3.25.1 Libc version: glibc-2.36
Python version: 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] (64-bit runtime) Python platform: Linux-6.1.0-20-amd64-x86_64-with-glibc2.36 ...