phlippe / uvadlc_notebooks

Repository of Jupyter notebook tutorials for teaching the Deep Learning Course at the University of Amsterdam (MSc AI), Fall 2023
https://uvadlc-notebooks.readthedocs.io/en/latest/
MIT License
2.54k stars 573 forks source link

Tutorial 7 single head GAT layer produces same output for nodes with identical neighbours #142

Closed E1k3 closed 6 months ago

E1k3 commented 6 months ago

Tutorial: 7 (pytorch)

Describe the bug The GAT layer is introduced as superior to plain graph convolutions, because it does not treat notes with identical neighbours identically by default, but when I set num_heads=1 in the provided short example, that is not the case.

To me, this does not make sense, but I also currently do not understand why it is happening.

To Reproduce (if any steps necessary) Run the GAT example with num_heads=1:

layer = GATLayer(2, 2, num_heads=1)
layer.projection.weight.data = torch.Tensor([[1., 0.], [0., 1.]])
layer.projection.bias.data = torch.Tensor([0., 0.])
layer.a.data = torch.Tensor([[-0.2, 0.3], [0.1, -0.1]])

with torch.no_grad():
    out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)

print("Adjacency matrix", adj_matrix)
print("Input features", node_feats)
print("Output features", out_feats)

Expected behavior Different output features for the last two nodes.

Runtime environment (please complete the following information):

phlippe commented 6 months ago

Hi, when you set the number of heads to 1, you need to adjust the weight, bias and a-vector accordingly. At the moment, the code above cannot run since there is a mismatch of shapes. A version that would run is the following:

layer = GATLayer(2, 2, num_heads=1)
layer.projection.weight.data = torch.Tensor([[1., 0.]])
layer.projection.bias.data = torch.Tensor([0.])
layer.a.data = torch.Tensor([[-0.2, 0.3]])

with torch.no_grad():
    out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)

print("Adjacency matrix", adj_matrix)
print("Input features", node_feats)
print("Output features", out_feats)

This gives the output:

Attention probs
 tensor([[[[0.3543, 0.6457, 0.0000, 0.0000],
          [0.1096, 0.1450, 0.2642, 0.4813],
          [0.0000, 0.1858, 0.2885, 0.5257],
          [0.0000, 0.2391, 0.2696, 0.4913]]]])
Adjacency matrix tensor([[[1., 1., 0., 0.],
         [1., 1., 1., 1.],
         [0., 1., 1., 1.],
         [0., 1., 1., 1.]]])
Input features tensor([[[0., 1.],
         [2., 3.],
         [4., 5.],
         [6., 7.]]])
Output features tensor([[[1.2913],
         [4.2344],
         [4.6798],
         [4.5043]]])

The output features for all nodes are different.

E1k3 commented 6 months ago

Sorry, I failed to copy paste my actual code.

I assumed the projection weight matrix shape would still be (2,2) and the shape of a would be (1,4):

layer = GATLayer(2, 2, num_heads=1)
layer.projection.weight.data = torch.eye(2)
layer.projection.bias.data = torch.zeros(2)
layer.a.data = torch.tensor([[-0.2, 0.3, 0.1, -0.1]])

with torch.no_grad():
    out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)

print("Adjacency matrix", adj_matrix)
print("Input features", node_feats)
print("Output features", out_feats)

Which results in

Attention probs
 tensor([[[[0.5000, 0.5000, 0.0000, 0.0000],
          [0.2500, 0.2500, 0.2500, 0.2500],
          [0.0000, 0.3333, 0.3333, 0.3333],
          [0.0000, 0.3333, 0.3333, 0.3333]]]])
Adjacency matrix tensor([[[1., 1., 0., 0.],
         [1., 1., 1., 1.],
         [0., 1., 1., 1.],
         [0., 1., 1., 1.]]])
Input features tensor([[[0., 1.],
         [2., 3.],
         [4., 5.],
         [6., 7.]]])
Output features tensor([[[1.0000, 2.0000],
         [3.0000, 4.0000],
         [4.0000, 5.0000],
         [4.0000, 5.0000]]])

I wasn't aware that a nn.Linear(2,2) can even have a weight matrix with shape (1,2).

One remaining problem for me is, that default initialization of the GATLayer parameters does not result in your manually set shapes:

layer = GATLayer(2, 2, num_heads=1)

with torch.no_grad():
    out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)

print("Adjacency matrix", adj_matrix)
print("Input features", node_feats)
print("Output features", out_feats)

Output features tensor([[[ 3.4466, -1.1232],
         [ 4.8904, -1.5303],
         [ 7.3678, -2.2289],
         [ 7.3678, -2.2289]]])

So you keep running into my original problem.

Thanks for the quick answer!

phlippe commented 6 months ago

Hi, sorry I mixed up the shapes there indeed, should have been with c_out=1. But I see now the problem you run into. This is only because of the specific value combination of w, b, a and the node features in this case: the 0.1, -0.1 in a leads to all nodes having the same impact, hence becoming a uniform distribution for all neighbors. If you change it to 0.1, -0.2, you will see again all values change. In a classical training, it is very unlikely to hit a combination of w, b, a and the node features like we have initialized them here and wouldn't be a problem anyway, since a single gradient step will take you away from it.

E1k3 commented 6 months ago

Very interesting that torch does not mind changing the shape of the weight matrix... I've tried random node_feats and GATLayer parameters and fairly often the resulting features for the last two nodes are still the "same" for the first four decimal places, but not always:

for x in range(10):
    node_feats = torch.rand((1,4,2))
    layer = GATLayer(2, 2, num_heads=1)

    with torch.no_grad():
        out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)

    print(f"Output features\n{out_feats[:, -2:]}")

Output features
tensor([[[-0.4876,  0.1265],
         [-0.4876,  0.1265]]])
Output features
tensor([[[-0.8985, -0.4233],
         [-0.8380, -0.3486]]])
Output features
tensor([[[-0.4323, -0.1635],
         [-0.4323, -0.1635]]])
Output features
tensor([[[ 0.1025, -0.3395],
         [ 0.1025, -0.3395]]])
Output features
tensor([[[ 0.8478, -0.7233],
         [ 0.8478, -0.7233]]])
Output features
tensor([[[0.6933, 0.9880],
         [0.6939, 0.9911]]])
Output features
tensor([[[1.0354, 1.2086],
         [1.0354, 1.2086]]])
Output features
tensor([[[-0.6915,  0.1529],
         [-0.7278,  0.1536]]])
Output features
tensor([[[-0.5950, -0.1346],
         [-0.6858, -0.2273]]])
Output features
tensor([[[-0.6591, -0.7194],
         [-0.6591, -0.7194]]])

It's probably just very likely to primarily "attend" to the same neighbours for nodes 3 and 4 and softmax causes the results to be similar enough to be indistinguishable in 4 decimals.

Thank you for looking into this!

E1k3 commented 6 months ago

no bugs here