Training on multiple graphs

pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch

https://pyg.org

MIT License

21.33k stars 3.66k forks source link

Training on multiple graphs #2677

Open RostyslavUA opened 3 years ago

RostyslavUA commented 3 years ago

From now on, we recommend using our discussion forum (https://github.com/rusty1s/pytorch_geometric/discussions) for general questions.

❓ Questions & Help

I am developing a model for the node classification task. I batch multiple graphs into the training and testing batches . After I train the model against one batch, I obtain some result that seem suspicious to me.

Let us say, my batch that contains the nodes for the training is given as follows: Batch(batch=[5811], edge_attr=[8340, 1], edge_index=[2, 8340], ptr=[11], test_mask=[5811], train_mask=[5811], val_mask=[5811], x=[5811, 40], y=[5811]) It contains 10 graphs as can be seen in ptr.

Next I train the model:

train_epoch = 200
for i in range (len(data_batched.ptr)-1):
    loss_trained = np.zeros(train_epoch, dtype = float)
    optimizer = torch.optim.Adam(modelGraphConv.parameters(), lr= 0.01, weight_decay=5e-4)
    criterion = torch.nn.CrossEntropyLoss() 
    for epoch in range (1, train_epoch+1):
        loss = train(data_batched[i], modelGraphConv)
        loss_trained[epoch-1] = loss
    print(f'Epoch: {epoch:03d}, Loss: {loss:.4f}')
    print('=========================')
    plt.plot(np.arange(train_epoch), loss_trained)
    plt.show()

and the result of the first three trainings is depicted below

Let us ignore the high loss for now. The thing that confuses me the most, is that at the beginning of each training, the loss jumps back to the value approximately 2. I would expect it continuously going down (or at least remain at the same level), since the multiple graphs that I use for training comes from the same simulation.

So the question is: do I make a mistake in the programming, or is it my misunderstanding of the neural network performance?

Thank you!

rusty1s commented 3 years ago

My guess is that this comes from re-initializing the optimizer in every training run. Can you try to move that call outside the first for loop?

RostyslavUA commented 3 years ago

Thank you for the answer! Even after I move the optimizer outside the first for loop, I get the following results

The loss still jumps back to large value...

RostyslavUA commented 3 years ago

my training function is

def train(data, model): 
    model.train() 
    optimizer.zero_grad() 
    out = model(data.x, data.edge_index, data.edge_attr) 
    loss = criterion(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step() 
    return loss

any my model is

from torch_geometric.nn import TransformerConv

class Transf(torch.nn.Module):
    def __init__(self, data, hidden_channels):
        super(Transf, self).__init__()
        torch.manual_seed(12345) 
        self.conv1 = TransformerConv(data.num_features, hidden_channels, edge_dim=2) 
        self.conv2 = TransformerConv(hidden_channels, num_classes, edge_dim=2) 
    def forward(self, x, edge_index, edge_attr): 
        x = self.conv1(x, edge_index, edge_attr) 
        x = x.relu() 
        x = self.conv2(x, edge_index, edge_attr) 

        return x

rusty1s commented 3 years ago

I see. To me, this indicates that your network indeed just heavily overfits on one single graph, and cannot transfer the knowledge to other graphs. What happens when you train your network with randomly sampled graphs from your training set, instead of one-after-one?

RostyslavUA commented 3 years ago

I randomly take 5 Data object from the list

from random import sample
randomly_sampled_data = sample(data_list, 5)

And the result of the training is unfortunately the same the same

RostyslavUA commented 3 years ago

I thought that the problem of overfitting is due to the fact that I use all of the nodes in the graph for training. So now I have set the 25 % of the nodes in the graph for training. I also reduced the number of epochs to 50 and set the learning rate 5...10 smaller than the previous one. Surprisingly, my curves look differently

However, there is still something wrong, since it looks like the untrained model performs better than the trained one...

rusty1s commented 3 years ago

I'm not sure I understand. What I mean is that currently you iterate over each graph, and train each graph in isolation. Instead, it's more reasonable to do something like this:

loader = DataLoader(data_list, batch_size=1, shuffle=True)

for epoch in range (1, train_epoch+1):
    for data in loader:
        loss = train(data, modelGraphConv)

RostyslavUA commented 3 years ago

Ah, I see! So let's say that data_list contains 10 graphs. What I did is the following:

from torch_geometric.data import DataLoader
loader = DataLoader(data_list, batch_size = 1, shuffle=True)

modelGraphConv = GraphConvClass(data, hidden_channels=16)
train_epoch = 200
loss_arr = np.zeros((len(data_list), train_epoch), dtype = float) 
optimizer = torch.optim.Adam(modelGraphConv.parameters(), lr= 0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss()

for epoch in range (1, train_epoch+1):
    k = 0
    for data in loader:
        loss = train(data, modelGraphConv)
        loss_arr[k, epoch-1] = loss 
        k +=1

for i in range (len(data_list)):
    plt.plot(np.arange(train_epoch), loss_arr[i, :])
plt.show()

So essentially the loss of each graph in the loader is saved along the rows of the matrix loss_arr. The operator in this particular case is GraphConv. Note that here I use 1 dimensional edge attribute.The result of is as follows

in parallel, I also train the TransformerConv model, were I use the same number of epochs and the same settings of the optimizer (learning rate, weight decay), but 2 dimensional edge attribute. The result is

Loss goes down much better, but how could we interpret this?

rusty1s commented 3 years ago

Yes, this looks more reasonable. What do you mean with how one can interpret this?

RostyslavUA commented 3 years ago

Sorry, I forgot to mention this. So what we did now is we have avoided overfitting on one particular graph, so that the model can make better predictions over the entire dataset. My problem is that even when I manage to bring the loss down, the accuracy remains very low. And this is reasonable, since my embedding space does not look well-classified. Let me clarify this farther with the example of TransformerConv which is defined as follows

class Transf(torch.nn.Module):
    def __init__(self, data, hidden_channels):
        super(Transf, self).__init__()
        torch.manual_seed(12345) 
        self.conv1 = TransformerConv(data.num_features, hidden_channels, edge_dim=2) 
        self.conv2 = TransformerConv(hidden_channels, hidden_channels, edge_dim=2) 
        self.conv3 = TransformerConv(hidden_channels, num_classes, edge_dim=2) 
    def forward(self, x, edge_index, edge_attr): 
        x = self.conv1(x, edge_index, edge_attr) 
        x = x.relu() 
        x = self.conv2(x, edge_index, edge_attr) 
        x = x.relu()
        x = self.conv3(x, edge_index, edge_attr)

        return x

Note that I have added one additional layer in comparison to the models that I mentioned earlier. Then I instantiate the class and depict the embeddings

modelTransf = Transf(data, hidden_channels=64)
outTransf = modelTransf(data.x, data.edge_index, data.edge_attr)

visualize(outTransf, color=data.y)

Before training, my embedding looks like this

then with the following parameters, through the training I manage to bring the loss down

modelTransf = Transf(data, hidden_channels=64)
train_epoch = 200
loss_arr = np.zeros((len(data_list), train_epoch), dtype = float)
optimizer = torch.optim.Adam(modelTransf.parameters(), lr= 0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss() 

for epoch in range (1, train_epoch+1):
    k = 0
    for data in loader:
        loss = train(data, modelTransf)
        loss_arr[k, epoch-1] = loss
        k +=1

for i in range (len(data_list)):
    plt.plot(np.arange(train_epoch), loss_arr[i, :])
plt.show()

which looks like

However, when I test the accuracy

accuracy, outTransf = test(data_test, modelTransf)[0:2]
print(f'Accuracy is {accuracy:.3f} ')
visualize(outTransf, color = data_test.y)

it remains in the range of 20 %, which is very low. And the reason can be seen in the embeddings after training, which is depicted below

From the figure above we can see that the nodes belonging to the same class are not classified well. Note that the accuracy always remains in the vicinity of 20 % and is not impacted by the value of loss. Which means that for the loss of 0.4 and 1.0, the accuracy remains similar. This also means to me that even when we avoid overfitting on one particular graph, the model cannot generalize for the whole dataset. I have tried to solve this problem by:

adding dropout between the layers
varying the parameters of the optimizer and number of epochs
applying standardization on the dataset
one-hot encoding the node features
trying different operators GCNConv, GraphConv
... So could you please give me some hints on where is the problem in my model could be located?

Thank you!

rusty1s commented 3 years ago

I'm not sure whether this is a problem in your model (at least your code looks correct to me). How does the training accuracy look like? I'm not sure I can give you any advice, since I do not know about your data. This might be a pre-processing problem as well, in which nodes have false connections.

RostyslavUA commented 3 years ago

I have modified my code a little bit, so that now I can check the accuracy of the training and testing:

def train(loader, model): 
    model.train() 
    for data in loader:
        optimizer.zero_grad() 
        out = model(data.x, data.edge_index, data.edge_attr) 
        loss = criterion(out[data.train_mask], data.y[data.train_mask])
        loss.backward()
        optimizer.step()

def test(loader, model):
    model.eval()
    total_nodes = 0
    correct = 0
    for data in loader:
        out = model(data.x, data.edge_index, data.edge_attr)
        pred = out.argmax(dim=1)   
        correct += int((pred == data.y).sum())
        total_nodes += data.num_nodes
    return correct/total_nodes

and now when I train and test

modelTransf = Transf(data, hidden_channels=64)
train_epoch = 200
train_acc = []
test_acc = []
optimizer = torch.optim.Adam(modelTransf.parameters(), lr= 0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss() 

for epoch in range (1, train_epoch+1):
    train(loader, modelTransf)
    train_acc.append(test(loader, modelTransf))
    test_acc.append(test(loader_test, modelTransf))

plt.plot(np.arange(train_epoch), np.array(train_acc), label='Training')
plt.plot(np.arange(train_epoch), np.array(test_acc), label='Test')
plt.legend()
plt.show()

my result for 15 graphs for training and 15 graphs for testing is

when I set 75 graphs for training and 23 graphs for testing, the resulting curves look like in the following picture

Regarding the false connections of the nodes, this does not seem to be the problem. I verify it in the following way

data = data_list[0] # select one graph from the list
data_net = to_networkx(data) 
[n for n in data_net[0]]

which returns me the neighbors of the particular node. I compare it then it with my original dataset and it matches. Therefore I think, that the nodes have the correct connections. Or did you mean something different?

Thank you!

rusty1s commented 3 years ago

This is interesting, the model does not seem to be able to generalize at all :( I sadly don't have any good advice on this one. What happens when you drop edge_index completely, e.g., replacing all GNN layers with PyTorch Linear layers? Does that increase test accuracy?

RostyslavUA commented 3 years ago

I did it in the following way: Modify training and testing functions

def train(loader, model): 
    model.train() 
    for data in loader:
        optimizer.zero_grad() 
        out = model(data.x) # drop edges
        loss = criterion(out[data.train_mask], data.y[data.train_mask])
        loss.backward()
        optimizer.step()

def test(loader, model):
    model.eval()
    total_nodes = 0
    correct = 0
    for data in loader:
        out = model(data.x) # drop edges
        pred = out.argmax(dim=1)   
        correct += int((pred == data.y).sum())
        total_nodes += data.num_nodes
    return correct/total_nodes

Create the model

from torch.nn import Linear

class Lin(torch.nn.Module):
    def __init__(self, data, hidden_channels):
        super().__init__()
        torch.manual_seed(12345)
        self.lin1 = Linear(data.num_features, hidden_channels)
        self.lin2 = Linear(hidden_channels, num_classes)
    def forward(self, x):
        x = self.lin1(x)
        x = x.relu()
        x = self.lin2(x)
        return x

then train

modelLin = Lin(data, hidden_channels = 64)
train_epoch = 200
train_acc = []
test_acc = []
optimizer = torch.optim.Adam(modelLin.parameters(), lr= 0.01, weight_decay=5e-4)
criterion = torch.nn.CrossEntropyLoss() 

for epoch in range (1, train_epoch+1):
    train(loader, modelLin)
    train_acc.append(test(loader, modelLin))
    print(f'epoch {epoch}')
    print('Training is done!')
    test_acc.append(test(loader_test, modelLin))
    print('Tesing is done!')
    print(f'Train accuracy is {train_acc[epoch-1]:.2f} and Test accuracy is {test_acc[epoch-1]:.2f}')
    print('=====')
plt.plot(np.arange(train_epoch), np.array(train_acc), label='Training')
plt.plot(np.arange(train_epoch), np.array(test_acc), label='Test')
plt.legend()
plt.show()

and the model still cannot generalize ;(

What could be a reason? Maybe I need a richer feature vector for each node? By the way, my Data object looks like this:

Data(edge_attr=[822, 2], edge_index=[2, 822], test_mask=[563], train_mask=[563], val_mask=[563], x=[563, 70], y=[563])

and the feature of the node is one hot encoded: only two entries take the value 1 and all other are set to 0.

rusty1s commented 3 years ago

It looks like test set features are always out-of-distribution. Mh, you can plot your node features via T-SNE to confirm. Maybe this gives you some more intuition on what's going wrong. Otherwise, I'm out of ideas :(

RostyslavUA commented 3 years ago

All right, so I have collapsed my 40 dimensional feature vector to 2D via T-SNE for one training graph and one test graph

data_emb = TSNE(n_components=2).fit_transform(data.x)
data_test_emb = TSNE(n_components=2).fit_transform(data_test.x)

and this is the result

If I can interpret the result correctly, it looks like the test set is not out-of-distribution, since both test and train data are distributed in a very similar manner.

Essentially, my node feature vector represents the coordinates in 2D, so before one-hot encoding (before bringing it up to 40 dimensions), it looks like this

In other words, picture above depicts the collected data.

At this moment, I cannot think about anything else I could try with the current dataset. If you got any other ideas by looking at my datasets regarding how to improve accuracy of the model or process/modify/analyze the data, I would widely appreciate if you shared. Otherwise, I thank you very much for helping me and we can close the thread.

rusty1s commented 3 years ago

Any reason why you convert your coordinates to a one-hot-encoding? If your graph is "spatial", then you can treat it as such in the GNN layer as well, e.g., by using SplineConv.

RostyslavUA commented 3 years ago

From my observations, with one-hot encoding of the node features, I am able to bring the loss down. Usually, the higher the dimensionality of one-hot encoded feature, the steeper the loss decrease. The accuracy does not grow, though.

Thanks for the advice, SplineConv seems more reasonable in this case. But the problem still remains: even though the training accuracy grows by few percent, the test accuracy does not. Here is my code:

from torch_geometric.nn import SplineConv

class Spline(torch.nn.Module):
    def __init__(self, data, hidden_channels):
        super().__init__()
        torch.manual_seed(12345)
        self.spl1 = SplineConv(data.num_features, hidden_channels, dim = 1, kernel_size = 2)
        self.spl2 = SplineConv(hidden_channels, num_classes, dim = 1, kernel_size = 2)
    def forward(self, x, edge_index, edge_attr):
        x = self.spl1(x, edge_index, edge_attr[:, 1][np.newaxis].T)
        x = x.relu()
        x = self.spl2(x, edge_index, edge_attr[:, 1][np.newaxis].T)
        return x

For the pseudo-coordinates, I use the one dimensional edge attribute (normalized to the value between 0 and 1) that in the collected data represents the distance between the nodes.

train_epoch = 200
train_acc = []
test_acc = []
modelSpl = Spline(data, hidden_channels = 16)
optimizer = torch.optim.Adam(modelSpl.parameters(), lr = 0.01, weight_decay = 5e-4)
criterion = torch.nn.CrossEntropyLoss()

for epoch in range(1, train_epoch+1):
    train(loader, modelSpl)
    train_acc.append(test(loader, modelSpl))
    test_acc.append(test(loader_test, modelSpl))
plt.plot(np.arange(train_epoch), np.array(train_acc), label='Training')
plt.plot(np.arange(train_epoch), np.array(test_acc), label='Test')
plt.legend()
plt.show()

and the resulting accuracy plot is very similar

Meanwhile, with the kernel_size = 3 the loss remains on approx. 1.6 throughout the entire training.

I have also noticed that if I set the kernel_size to some large value e.g. 100 then my loss goes down to approx 0.9 and the training accuracy grows to 40 %. Test accuracy remains on 20 %. In addition, I have also tried to use 2 dimensional edge_attr, change number of graphs in the loader (from 15 to 75 in the training set and 5 to 15 in the test), different number of channels between and the result does not change much.

Do you know anything else I could try? Thank you!

rusty1s commented 3 years ago

For edge_attr, you can directly make use of the 2D coordinates, and for features, you could try to go with just a single feature holding a 1 (similar to what we do in FAUST). This will drop absolute coordinate information, which might help the model to generalize better.

RostyslavUA commented 3 years ago

For some strange reason, my kernel dies when I set the kernel_size = 2 after I put 2D coordinates in edge_attr. For kernel_size = 1 the kernel does not die, however, the accuracy is not improved and loss is approximately 1.6. Meanwhile, with TransformerConv the kernel does not die (but again no improvement in accuracy). Let me explain it in more detail.

Now my edge_attr looks like

tensor([[ 0.5662,  0.3955,  0.5494,  0.4265],
        [-0.8367, -0.4174, -0.8214, -0.4449],
        [-0.1971,  0.6470, -0.2018,  0.6349],
        ...,
        [ 0.7983,  0.7754,  0.7710,  0.7770],
        [-0.0177,  0.9949, -0.0338,  1.0000],
        [-0.0150,  0.9933, -0.0338,  1.0000]])

where first 2 columns represent the coordinates of the source node, and last 2 columns - the coordinates of the destination node. The node feature is a 1 for all the nodes. My model is the same

from torch_geometric.nn import SplineConv

class Spline(torch.nn.Module):
    def __init__(self, data, hidden_channels):
        super().__init__()
        torch.manual_seed(12345)
        self.spl1 = SplineConv(data.num_features, hidden_channels, dim = 4, kernel_size = 2)
        self.spl2 = SplineConv(hidden_channels, num_classes, dim = 4, kernel_size = 2)
    def forward(self, x, edge_index, edge_attr):
        x = self.spl1(x, edge_index, edge_attr)
        x = x.relu()
        x = self.spl2(x, edge_index, edge_attr)
        return x

When I start training

train_epoch = 200
train_acc = []
test_acc = []
modelSpl = Spline(data, hidden_channels = 16)
optimizer = torch.optim.Adam(modelSpl.parameters(), lr = 0.01, weight_decay = 5e-4)
criterion = torch.nn.CrossEntropyLoss()

for epoch in range(1, train_epoch+1):
    train(loader, modelSpl)
    train_acc.append(test(loader, modelSpl))
    test_acc.append(test(loader_test, modelSpl))
plt.plot(np.arange(train_epoch), np.array(train_acc), label='Training')
plt.plot(np.arange(train_epoch), np.array(test_acc), label='Test')
plt.legend()
plt.show()

my loss either jumps to inf or becomes nan and the kernel dies (again, for kernel_size = 1, the loss is 1.6 and it keeps going). Note that I receive the warning when I create the model

UserWarning: We do not recommend using the non-optimized CPU version of `SplineConv`. If possible, please move your data to GPU.
  warnings.warn(

but my gpu has no cuda, so I have to go with cpu anyway.

Such an unexpected problem :) Do you know what can cause this? Thank you very much!

rusty1s commented 3 years ago

The SplineConv expects edge features to be in the interval [0, 1] and I think this may cause this issue. Instead of inputting absolute coordinates as edge features, the idea in SplineConv is that you input relative coordinates instead. You can do this via the T.Cartesian() transform, e.g.:

data.pos = ... # Node positions
data = ToCartesian()(data)
conv= SplineConv(1, hidden_channels, dim = 2 kernel_size = 5)

It's a bummer that you do not have access to a GPU, since the SplineConv is actually quite slow on CPU :(

RostyslavUA commented 3 years ago

All right, even though I did not increase my test accuracy yet, there is an interesting observation that I made. Generally speaking, I want to reduce the number of conflicting nodes (nodes that have the same class and are connected by an edge).

Once again, my Data object is as follows

Data(edge_attr=[840, 2], edge_index=[2, 840], pos=[596, 2], test_mask=[596], train_mask=[596], val_mask=[596], x=[596, 1], y=[596])

where edge_attr contain the relative coordinates of the nodes. calculated as you mentioned earlier. x contains 1 for each node.

Model

class Spline(torch.nn.Module):
    def __init__(self, data, hidden_channels):
        super().__init__()
        torch.manual_seed(12345)
        self.spl1 = SplineConv(data.num_features, hidden_channels, dim = 2, kernel_size = 5)
        self.spl2 = SplineConv(hidden_channels, num_classes, dim = 2, kernel_size = 5)
    def forward(self, x, edge_index, edge_attr):
        x = self.spl1(x, edge_index, edge_attr)
        x = x.relu()
        x = self.spl2(x, edge_index, edge_attr)
        return x

Training itself

train_epoch = 200
train_acc = []
test_acc = []
test_conf_acc = []
modelSpl = Spline(data, hidden_channels = 16)
optimizer = torch.optim.Adam(modelSpl.parameters(), lr = 0.01, weight_decay = 5e-4)
criterion = torch.nn.CrossEntropyLoss()

for epoch in range(1, train_epoch+1):
    train(loader, modelSpl)
    train_acc.append(test(loader, modelSpl))
    test_acc.append(test(loader_test, modelSpl))
    test_conf_acc.append(test_only_conf(loader_test, modelSpl)) # accuracy w.r.t. conflicting nodes

The test accuracy of the model with SplineConv operator does not grow (maybe there is indeed some error when I preprocess the data, I am not sure), so I have decided to check if the model does that I want: reduction of the number of conflicting nodes.

I do it in the following way

def test_only_conf(loader, model):
    model.eval()
    total_edges = 0
    correct = 0
    clash = 0
    for data in loader:
        total_edges += data.num_edges
        out = model(data.x, data.edge_index, data.edge_attr)
        pred = out.argmax(dim=1)
        data_net = to_networkx(data)
        for j in range(len(data.x)): 
            neighb = [n for n in data_net[j]] # get the neighbors
            node_pred = pred[j] # get the prediction of the node of interest
            for m in range(len(neighb)):
                if node_pred == pred[neighb[m]]: # check if the classes of the selected node and its neighbors' are the same
                    clash +=1 
    return 1-clash/(2*total_edges) # returns accuracy. 2 is due to visiting each node twice

and the result I obtain is

where accuracy w.r.t. conflicts I have marked Test-conflict.

From the figure above we can observe that at the beginning of the training, that the model performs with Test-conflict accuracy of 65 %, then it grows to 90 %. So this is essentially what I need! However, the predictions of the model do not match the labels. I believe that this is still a problem. This is due to the fact that if we look at the embeddings of one of the graphs that I used for training, it looks like this

where I cannot see any similarity (e.g. distance) between the nodes that belong to the same class ;( Which also indicates that in crease of the Test-conflict accuracy is rather something inherit to the ConvSpline operator than something induced by my dataset.

What do you think about it?

rusty1s commented 3 years ago

Not sure I fully understand it yet. So equally labeled nodes have the same label, but their label does not match with the ground-truth label? I wonder whether there is any indication in the dataset, which nodes belongs to which label?

RostyslavUA commented 3 years ago

So equally labeled nodes have the same label, but their label does not match with the ground-truth label?

No, equally labeled nodes do not necessarily match the ground-truth label, however, during the training, the number of conflicting nodes reduces (a node pair that has the same class).

This comes from the fact, that the predictions at the beginning of the training are less diverse than towards the end of the training. Let me give you an example: at the 1st epoch, my predictions are

Predictions tensor([4, 3, 3, 3, 4, 4, 4, 4, 4, 4, 3, 3, 4, 3, 3, 4, 3, 3, 4, 3, 3, 3, 3, 3,
        3, 3, 3, 4, 4, 3, 4, 3, 4, 3, 4, 3, 3, 4, 3, 3, 4, 3, 4, 3, 3, 3, 4, 4,
        3, 4, 3, 4, 3, 4, 4, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3,
        3, 3, 4, 4, 4, 3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 4,
        4, 4, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 4, 4, 4, 4, 4, 4, 3,
        3, 4, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 3, 4, 3, 4, 3, 3, 3, 3, 3,
        3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3,
        4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 4, 3, 3, 3, 3, 3, 4, 4, 4,
        3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3,
        4, 3, 4, 3, 4, 4, 4, 4, 3, 3, 3, 3, 4, 4, 3, 3, 3, 4, 4, 4, 4, 3, 3, 4,
        3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3,
        3, 3, 3, 4, 4, 3, 4, 3, 4, 3, 4, 3, 3, 4, 4, 3, 4, 4, 4, 3, 4, 3, 4, 4,
        4, 4, 4, 3, 4, 3, 4, 3, 4, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 4, 4, 3, 3, 3,
        4, 3, 3, 3, 4, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3,
        3, 4, 3, 3, 3, 3, 4, 3, 4, 4, 3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 3, 3, 4, 3,
        3, 3, 3, 3, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4, 4,
        4, 3, 3, 4, 3, 3, 4, 4, 3, 3, 3, 3, 3, 4, 3, 3, 4, 4, 3, 3, 3, 4, 3, 4,
        3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 4, 3, 3,
        3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3,
        4, 3, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 4, 3, 4, 3, 4, 3, 3, 4, 3, 4, 3, 4,
        4, 4, 4, 3, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 4, 3, 4,
        4, 3, 4, 3, 4, 4, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 4, 3, 3, 3, 3, 4, 4,
        4])

at the 10th epoch

Predictions tensor([4, 4, 0, 0, 4, 4, 4, 2, 4, 4, 4, 3, 4, 4, 1, 1, 0, 3, 4, 1, 3, 4, 0, 0,
        0, 3, 1, 1, 4, 1, 4, 0, 4, 1, 0, 0, 0, 0, 0, 0, 1, 3, 4, 1, 2, 0, 4, 4,
        0, 4, 4, 4, 0, 4, 4, 0, 4, 2, 4, 3, 4, 2, 1, 2, 1, 0, 4, 4, 4, 0, 3, 1,
        1, 1, 2, 4, 4, 3, 1, 2, 2, 4, 0, 0, 0, 4, 2, 1, 3, 3, 1, 2, 0, 4, 0, 1,
        4, 4, 4, 3, 1, 0, 2, 0, 0, 1, 3, 1, 4, 4, 3, 4, 3, 0, 0, 2, 4, 4, 2, 4,
        3, 1, 0, 1, 0, 1, 3, 1, 1, 2, 4, 1, 0, 4, 1, 1, 4, 1, 4, 1, 3, 2, 1, 4,
        3, 4, 4, 1, 1, 4, 3, 0, 3, 3, 4, 0, 3, 0, 2, 0, 3, 2, 3, 4, 1, 4, 3, 0,
        4, 3, 1, 0, 4, 3, 2, 1, 3, 4, 4, 3, 4, 1, 1, 4, 0, 3, 0, 1, 3, 4, 4, 4,
        3, 3, 3, 4, 4, 0, 1, 0, 0, 3, 1, 4, 1, 4, 4, 1, 3, 3, 4, 0, 1, 0, 1, 1,
        4, 0, 1, 0, 4, 4, 4, 4, 0, 1, 3, 1, 4, 4, 1, 0, 0, 4, 4, 3, 0, 2, 4, 1,
        1, 4, 1, 4, 1, 0, 3, 1, 0, 2, 3, 4, 0, 2, 4, 4, 2, 0, 0, 2, 1, 1, 0, 1,
        4, 1, 2, 4, 4, 1, 4, 2, 4, 1, 4, 0, 1, 1, 4, 1, 4, 1, 4, 3, 4, 4, 1, 4,
        4, 4, 1, 3, 4, 2, 4, 3, 0, 2, 0, 4, 3, 0, 4, 1, 3, 0, 4, 4, 1, 4, 0, 3,
        1, 0, 1, 0, 4, 4, 1, 2, 1, 4, 2, 1, 3, 4, 1, 2, 0, 0, 2, 3, 1, 4, 3, 2,
        0, 1, 3, 3, 1, 4, 4, 3, 4, 1, 3, 3, 4, 4, 1, 1, 0, 1, 1, 1, 0, 1, 4, 0,
        2, 4, 1, 3, 4, 4, 0, 3, 4, 3, 1, 4, 0, 0, 1, 4, 4, 4, 1, 0, 3, 4, 1, 4,
        4, 0, 2, 1, 1, 0, 4, 1, 0, 3, 4, 1, 0, 4, 4, 0, 4, 4, 3, 0, 4, 1, 4, 4,
        4, 4, 0, 3, 3, 0, 0, 0, 1, 4, 1, 4, 0, 4, 0, 4, 4, 3, 3, 4, 3, 4, 1, 3,
        2, 3, 4, 3, 0, 3, 1, 0, 1, 1, 4, 4, 0, 3, 3, 3, 1, 3, 3, 0, 4, 3, 0, 4,
        4, 1, 3, 4, 1, 0, 4, 1, 3, 2, 4, 4, 4, 0, 4, 0, 4, 1, 4, 4, 3, 4, 3, 1,
        4, 3, 1, 3, 1, 1, 1, 4, 1, 1, 1, 0, 3, 1, 0, 4, 4, 0, 2, 3, 4, 4, 0, 4,
        4, 1, 2, 1, 4, 1, 3, 3, 4, 0, 4, 4, 4, 2, 3, 1, 2, 1, 3, 3, 4, 0, 2, 4,
        4])

at the 100th epoch

Predictions tensor([4, 1, 1, 4, 3, 4, 4, 2, 4, 1, 4, 0, 1, 4, 1, 1, 0, 3, 4, 4, 3, 4, 0, 0,
        0, 0, 3, 1, 3, 1, 2, 0, 4, 3, 4, 0, 0, 0, 1, 0, 1, 3, 4, 1, 2, 0, 2, 4,
        0, 4, 4, 3, 0, 0, 4, 2, 1, 4, 3, 3, 4, 2, 1, 2, 1, 0, 3, 4, 4, 0, 3, 3,
        1, 0, 2, 3, 4, 4, 1, 2, 2, 1, 0, 2, 0, 3, 2, 1, 3, 3, 1, 2, 0, 3, 2, 1,
        4, 4, 4, 3, 1, 0, 2, 0, 1, 1, 3, 1, 0, 4, 0, 4, 4, 1, 0, 2, 4, 1, 2, 4,
        3, 1, 0, 1, 3, 1, 3, 1, 1, 2, 4, 1, 0, 0, 3, 1, 0, 1, 1, 1, 3, 2, 1, 4,
        3, 4, 0, 1, 0, 1, 3, 0, 0, 0, 4, 0, 3, 0, 2, 0, 3, 2, 3, 4, 1, 3, 0, 0,
        4, 3, 1, 0, 4, 3, 2, 1, 3, 4, 4, 2, 4, 1, 1, 1, 0, 1, 0, 1, 3, 4, 4, 4,
        3, 3, 1, 3, 3, 0, 1, 0, 0, 0, 3, 4, 1, 1, 4, 3, 1, 1, 2, 0, 2, 0, 1, 1,
        4, 2, 1, 1, 4, 3, 1, 3, 0, 1, 0, 1, 4, 4, 1, 0, 0, 4, 4, 3, 0, 3, 4, 1,
        1, 4, 2, 0, 1, 0, 3, 1, 0, 2, 3, 1, 0, 2, 4, 0, 2, 1, 0, 2, 1, 1, 0, 2,
        4, 1, 2, 1, 3, 1, 4, 2, 2, 1, 4, 2, 1, 1, 4, 2, 4, 1, 3, 0, 1, 4, 1, 4,
        4, 3, 1, 3, 4, 2, 3, 3, 0, 2, 0, 1, 2, 0, 4, 1, 3, 3, 2, 0, 1, 4, 0, 3,
        1, 0, 1, 0, 4, 4, 1, 2, 4, 2, 2, 1, 3, 4, 1, 2, 3, 0, 2, 4, 1, 4, 3, 2,
        3, 1, 4, 3, 0, 2, 4, 3, 1, 1, 2, 3, 4, 3, 1, 2, 2, 1, 1, 1, 0, 2, 4, 0,
        2, 4, 3, 3, 3, 4, 0, 3, 1, 3, 3, 0, 0, 0, 1, 1, 0, 4, 0, 3, 0, 3, 4, 4,
        4, 0, 2, 1, 1, 0, 4, 1, 2, 1, 4, 0, 0, 4, 0, 0, 4, 3, 1, 0, 4, 1, 4, 4,
        4, 4, 0, 3, 3, 0, 0, 1, 1, 4, 1, 4, 0, 2, 2, 4, 4, 3, 3, 4, 3, 4, 3, 3,
        1, 3, 0, 3, 0, 4, 1, 0, 1, 1, 4, 1, 0, 3, 3, 3, 1, 2, 3, 1, 4, 3, 0, 3,
        1, 1, 0, 2, 1, 0, 4, 1, 3, 3, 3, 4, 4, 2, 4, 1, 0, 1, 2, 4, 3, 4, 3, 1,
        4, 3, 1, 3, 1, 1, 1, 4, 1, 1, 1, 1, 3, 3, 0, 4, 0, 0, 2, 4, 1, 4, 0, 4,
        3, 1, 2, 0, 1, 4, 3, 3, 2, 0, 1, 3, 4, 4, 0, 1, 2, 1, 3, 3, 4, 0, 2, 4,
        4])

As we see, at the beginning, the predictions are essentially 3 and 4 and the more we train, the more diverse the predictions become. This explains why we have more conflicting nodes at the beginning and less conflicting nodes at the end of the training.

Meanwhile, the ground-truth label has some value between 0 and 4 for each node.

I wonder whether there is any indication in the dataset, which nodes belongs to which label?

Yes, there are ground-truth labels in my dataset. The loss and the accuracy is calculated w.r.t. them.

Let me repeat, the objective is to obtain the least number of conflicting nodes (that what Test-conflict accuracy is about). At the same time, there is a crucial point that I have been overlooking until now: there are multiple solutions possible and my ground-truth labels indicate one of the (sub)optimal solutions. So maybe the predictions of the model are not wrong - they just don't match the ground-truth labels! That is maybe the reason, why the usual accuracy is always at 20 %.

The abovementioned idea is contradictory to the following one: I believe that the loss and the usual accuracy must still improve, since I indicate by the ground-truth labels what is the solution. This indeed is the solution the model must approach as close as possible.

From the last two paragraphs, can you say where I am right? And in general what would be the operators that would help me to find the optimal solution?

Thanks a lot!

rusty1s commented 3 years ago

I see. I don't think a classic node classification criterion is ideal in this scenario, as you correctly pointed out that your labels are only one valid solution. I would frame this problem as a contrastive learning/link prediction, that is, you want nodes with the same label to have high probability, and nodes with different labels to have low probability.

RostyslavUA commented 3 years ago

you want nodes with the same label to have high probability, and nodes with different labels to have low probability.

I guess it is other way around :) nodes with the same label must have low probability and nodes with different labels - high probability. I may be wrong about it, but I still think it is a different problem from edge-predictions: in literature my problem is usually referred as graph-coloring problem (reduce the number of nodes that have the same color).

Thus, node classification still makes more sense to me. Also, at the input I am already given a graph. In other words, the edge between the nodes is already predefined.

There are multiple correct ways to "color" the graph (to classify the nodes) and my dataset indicates one out of many solutions. My solution (labels) are likely optimal and I want to train model against them.

Eventually, I want to feed into the trained model a non-colored graph and at the output get the colored graph.

Or I just didn't get your idea of edge-predictions correctly?

rusty1s commented 3 years ago

You are right that you can cast this problem as a node classification problem, but as you correctly identified the ground-truth information is only a single solution to your problem. In the end, the machine learning model can not really learn the correct "color" of nodes as it is arbitrary. As such, you need to train your model independent of its specific ground-truth color. The best way in doing so is via contrastive loss, i.e., separate the nodes with the same label from all other nodes. This is equivalent in doing link prediction, i.e., find the nodes that are (dis-)connected.

RostyslavUA commented 3 years ago

Okay! I am already working on this. We can close the thread now.

Thank you very much for your help!

RostyslavUA commented 3 years ago

As you suggested, I have reformulated the problem to the link prediction frame and the testing accuracy seems to be higher. However, I am not sure how to correctly extract the predicted links. It confuses me that the model predicts many more edges that actually are in the Data object and accuracy is still high.

In particular, the data object at the beginning has 10556 edges: Data(edge_index=[2, 10556], test_mask=[2708], train_mask=[2708], val_mask=[2708], x=[2708, 1433], y=[2708])

then we split the edges

data = train_test_split_edges(data)

and add negative edges by means of negative_sampling.

Now, the training/testing/validation performance is shown below

Since the test accuracy is very high, I expect that the number and connectivity of predicted links is very similar to the Data object given as an input.

To extract those edges, I do

z = model.encode(data.x, data.train_pos_edge_index)
final_edge_index = model.decode_all(z)

and when I check the number of predicted edges

print(final_edge_index.size(1))

I get 3465922 which is much larger than the number of edges of the input graph. I do not understand, how with such a large difference of the edge numbers, the accuracy remains at 90 %. So my question is: how do I correctly extract the predicted edges? Or if these predictions are correct, how do I make the model predict only missing edges?

Thank you!

rusty1s commented 3 years ago

The final_edge_index will use a threshold of 0.5 to decide whether to include an edge or not, which might be to low in your experiment. To only keep edges with higher probability, run:

prob_adj = z @ z.t().sigmoid()
return (prob_adj > threshold).nonzero(as_tuple=False).t()