Closed E1k3 closed 6 months ago
Hi, when you set the number of heads to 1, you need to adjust the weight, bias and a-vector accordingly. At the moment, the code above cannot run since there is a mismatch of shapes. A version that would run is the following:
layer = GATLayer(2, 2, num_heads=1)
layer.projection.weight.data = torch.Tensor([[1., 0.]])
layer.projection.bias.data = torch.Tensor([0.])
layer.a.data = torch.Tensor([[-0.2, 0.3]])
with torch.no_grad():
out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)
print("Adjacency matrix", adj_matrix)
print("Input features", node_feats)
print("Output features", out_feats)
This gives the output:
Attention probs
tensor([[[[0.3543, 0.6457, 0.0000, 0.0000],
[0.1096, 0.1450, 0.2642, 0.4813],
[0.0000, 0.1858, 0.2885, 0.5257],
[0.0000, 0.2391, 0.2696, 0.4913]]]])
Adjacency matrix tensor([[[1., 1., 0., 0.],
[1., 1., 1., 1.],
[0., 1., 1., 1.],
[0., 1., 1., 1.]]])
Input features tensor([[[0., 1.],
[2., 3.],
[4., 5.],
[6., 7.]]])
Output features tensor([[[1.2913],
[4.2344],
[4.6798],
[4.5043]]])
The output features for all nodes are different.
Sorry, I failed to copy paste my actual code.
I assumed the projection weight matrix shape would still be (2,2) and the shape of a would be (1,4):
layer = GATLayer(2, 2, num_heads=1)
layer.projection.weight.data = torch.eye(2)
layer.projection.bias.data = torch.zeros(2)
layer.a.data = torch.tensor([[-0.2, 0.3, 0.1, -0.1]])
with torch.no_grad():
out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)
print("Adjacency matrix", adj_matrix)
print("Input features", node_feats)
print("Output features", out_feats)
Which results in
Attention probs
tensor([[[[0.5000, 0.5000, 0.0000, 0.0000],
[0.2500, 0.2500, 0.2500, 0.2500],
[0.0000, 0.3333, 0.3333, 0.3333],
[0.0000, 0.3333, 0.3333, 0.3333]]]])
Adjacency matrix tensor([[[1., 1., 0., 0.],
[1., 1., 1., 1.],
[0., 1., 1., 1.],
[0., 1., 1., 1.]]])
Input features tensor([[[0., 1.],
[2., 3.],
[4., 5.],
[6., 7.]]])
Output features tensor([[[1.0000, 2.0000],
[3.0000, 4.0000],
[4.0000, 5.0000],
[4.0000, 5.0000]]])
I wasn't aware that a nn.Linear(2,2)
can even have a weight matrix with shape (1,2).
One remaining problem for me is, that default initialization of the GATLayer parameters does not result in your manually set shapes:
layer = GATLayer(2, 2, num_heads=1)
with torch.no_grad():
out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)
print("Adjacency matrix", adj_matrix)
print("Input features", node_feats)
print("Output features", out_feats)
→
Output features tensor([[[ 3.4466, -1.1232],
[ 4.8904, -1.5303],
[ 7.3678, -2.2289],
[ 7.3678, -2.2289]]])
So you keep running into my original problem.
Thanks for the quick answer!
Hi, sorry I mixed up the shapes there indeed, should have been with c_out=1. But I see now the problem you run into. This is only because of the specific value combination of w
, b
, a
and the node features in this case: the 0.1, -0.1
in a
leads to all nodes having the same impact, hence becoming a uniform distribution for all neighbors. If you change it to 0.1, -0.2
, you will see again all values change. In a classical training, it is very unlikely to hit a combination of w
, b
, a
and the node features like we have initialized them here and wouldn't be a problem anyway, since a single gradient step will take you away from it.
Very interesting that torch does not mind changing the shape of the weight matrix...
I've tried random node_feats
and GATLayer
parameters and fairly often the resulting features for the last two nodes are still the "same" for the first four decimal places, but not always:
for x in range(10):
node_feats = torch.rand((1,4,2))
layer = GATLayer(2, 2, num_heads=1)
with torch.no_grad():
out_feats = layer(node_feats, adj_matrix, print_attn_probs=True)
print(f"Output features\n{out_feats[:, -2:]}")
Output features
tensor([[[-0.4876, 0.1265],
[-0.4876, 0.1265]]])
Output features
tensor([[[-0.8985, -0.4233],
[-0.8380, -0.3486]]])
Output features
tensor([[[-0.4323, -0.1635],
[-0.4323, -0.1635]]])
Output features
tensor([[[ 0.1025, -0.3395],
[ 0.1025, -0.3395]]])
Output features
tensor([[[ 0.8478, -0.7233],
[ 0.8478, -0.7233]]])
Output features
tensor([[[0.6933, 0.9880],
[0.6939, 0.9911]]])
Output features
tensor([[[1.0354, 1.2086],
[1.0354, 1.2086]]])
Output features
tensor([[[-0.6915, 0.1529],
[-0.7278, 0.1536]]])
Output features
tensor([[[-0.5950, -0.1346],
[-0.6858, -0.2273]]])
Output features
tensor([[[-0.6591, -0.7194],
[-0.6591, -0.7194]]])
It's probably just very likely to primarily "attend" to the same neighbours for nodes 3 and 4 and softmax causes the results to be similar enough to be indistinguishable in 4 decimals.
Thank you for looking into this!
no bugs here
Tutorial: 7 (pytorch)
Describe the bug The GAT layer is introduced as superior to plain graph convolutions, because it does not treat notes with identical neighbours identically by default, but when I set num_heads=1 in the provided short example, that is not the case.
To me, this does not make sense, but I also currently do not understand why it is happening.
To Reproduce (if any steps necessary) Run the GAT example with
num_heads=1
:Expected behavior Different output features for the last two nodes.
Runtime environment (please complete the following information):