pyg-team / pytorch_geometric

Graph Neural Network Library for PyTorch
https://pyg.org
MIT License
20.99k stars 3.62k forks source link

Dataset construction for shared adjacency matrix and varying node features #965

Closed chnsh closed 4 years ago

chnsh commented 4 years ago

❓ Questions & Help

First of all, thank you for the package - it is very well designed! I am trying to use it for my problem and aim to use PyTorch Geometric to implement new architectures.

The examples in the documentation all talk about creating Data with edge_index, x, etc. In my case, the underlying adjacency matrix is the same across the dataset and I only have varying node features (and labels).

I want to use a PyG model and setup dataset correctly such that I vary the node features and share the underlying graph topology (crucially without copying the graph len(dataset) times).

I was thinking of doing this by passing the adjacency matrix in the constructor of a Dataset class and using that in the getitem method - I wanted to know if there are any caveats to this approach or if it violates any best practices?

rusty1s commented 4 years ago

You can use PyG operators to work in a batch-wise fashion on same graph topology by using features of shape [batch_size, num_nodes, num_features] and setting the node_dim attribute of operators to 1. For dataset construction, I would save the edge_index as a property of the dataset, and only access node features via getitem.

chnsh commented 4 years ago

Thanks for responding! What do you mean by saving edge_index as a property of the dataset? I had tried that with GCN where what I did was pass the same edge index and kept changing the features and got completely incorrect results.

From that experiment I got the idea that possibly PyG requires each data point to have an edge_index definition (possibly for batching purposes?)

Also, do you happen to have an example of using node_dim? I am not entirely sure what you mean by it

rusty1s commented 4 years ago

So, what I mean is:

conv = GCNConv(in_channels, out_channels, node_dim=1)
conv(x, edge_index)

where x is a [batch_size, num_nodes, num_features] tensor and edge_index holds indices < num_nodes. For datasets, you can then use something like this

class MyDataset(torch.utils.Dataset)
    def __init__(...):
        self.edge_index = ...
        self.x_all = ...
    def __getitem__(self, idx):
        return x_all[idx]

and use the regular PyTorch DataLoader to create batches for x.

chnsh commented 4 years ago

Got it! Thanks.

Makes sense on the data loader, will change my implementation and update.

Can you point me to an explanation of what node_dim is doing? I mean my situation has a shared adjacency matrix, not sure why I have to change the axis of propagation?

rusty1s commented 4 years ago

You tell the convolution operator which dimension it should view as the node dimension. So what MessagePassing is doing under the hood is the following:

source_nodes, target_nodes = edge_index
x_j = x.gather(node_dim, source_nodes) # Will result in a [batch_size, num_edges, num_features] tensor in your case
out = scatter_add(node_dim, target_nodes) # Will result in a [batch_size, num_nodes, num_features] tensor in your case
chnsh commented 4 years ago

Apologies for the delay, this works quite well - Thanks! It took me a long time to figure a bug out, when I use torch_geometric.from_networkx(G) it distorted my graph because my graph was already integer ordered and the method reorders it.

Is that something you want me to file an issue for?

rusty1s commented 4 years ago

This might be a bug, do you have a minimal example to reproduce it?

chnsh commented 4 years ago

If you have a graph like so: G = nx.fast_gnp_random_graph(100, 0.1) and if you use torch_geometric.from_networkx(G), because my graph is already integer ordered, and because networkx implements nodes as a dictionary, it is not ordered and hence the graph gets distorted

Maria1810 commented 10 months ago

Hi,

this conversation is already a bit older and when I look at torch geometric's documentation of GCNConv layer I don't see the node_dim argument. However, I currently have the same issue as described here and I really hope someone can help me. I have a dataset of many graphs that all have 1521 nodes and 13 features per node. They all share the same adjacency matrix, which had shape 1521 x 1521 but after edge_index = adjacency_matrix.nonzero().t().contiguous() has shape [2, 55543]. I organized my data in batches of batch_size=20. So batch.x has the shape [20, 1521, 13] and batch.edge_index has the shape [20,2,55543]. When I simply try to do

class GNNModel(nn.Module):
    def __init__(self, num_features, num_feats_y):
        super(GNNModel, self).__init__()

        self.conv1 = GCNConv(num_features, 8, node_dim=1)

    def forward(self, x, edge_index):

        x = self.conv1(x, edge_index)

        return x

I get the error:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 20 but got size 2 for tensor number 1 in the list. While when I try with batch.edge_index to be of shape [2,55543] I get the error:

RuntimeError: mat1 and mat2 shapes cannot be multiplied (30420x13 and 1521x8)

So how exactly am I supposed to adjust my code here? Please help.

rusty1s commented 10 months ago

Please note that edge_index always need to be two-dimensional, so the mini-batching of [batch_size, num_nodes, num_features] node features is designed to operate on the same underlying graph of shape [2, num_edges]. If your graph is not static across examples, you will have to use PyG's approach of diagonal adjacency matrix stacking.

Maria1810 commented 10 months ago

Thank you so much for your answer. Yes, now I concatenated my edge index 20 times. So instead of [2,55543] it has the shape [2, 1110860]. However, the node_dim = 1 in self.conv1 = GCNConv(num_features, 8, node_dim=1) lead to a dimension mismatch and now it runs without errors, after removing the node_dim=1.

My model looks like this:

` class GNNModel(nn.Module): def init(self, num_features, num_feats_y): super(GNNModel, self).init()

    self.conv1 = GCNConv(num_features, 8)
    self.bn1 = BatchNorm(8)
    self.conv2 = GCNConv(8, 16)
    self.bn2 = BatchNorm(16)
    self.conv3 = GCNConv(16, 32)
    self.bn3 = BatchNorm(32)
    self.conv4 = GCNConv(32, 64)
    self.bn4 = BatchNorm(64)
    self.conv5 = GCNConv(64, num_feats_y)
    self.dropout = torch.nn.Dropout(p=0.1)

    self.fc = nn.Sequential(nn.Linear(num_feats_y, 32), nn.GELU(), 
                            nn.Linear(32, 64), nn.GELU(),
                            nn.Linear(64, num_feats_y))

def forward(self, x, edge_index):
       x = self.conv1(x, edge_index)
       x = self.bn1(x)
       x = F.gelu(x)
       x = self.dropout(x)
       x = self.conv2(x, edge_index)
       x = self.bn2(x)
       x = F.gelu(x)
       x = self.dropout(x)
       x = self.conv3(x, edge_index)
       x = self.bn3(x)
       x = F.gelu(x)
       x = self.dropout(x)
       x = self.conv4(x, edge_index)
       x = self.bn4(x)
       x = F.gelu(x)
       x = self.dropout(x)
       x = self.conv5(x, edge_index)
       x = F.gelu(x)
       x = self.fc(x)

      return x `

I have a time series of 13 features (for example temperatures at 13 height levels or pressures at 13 height levels) at 1521 different locations (lat-lon-combinations). I model the lat-lon-grid as a graph, so each location is one node and at each node I have 13 features (the 13 temperatures at different heights). In the adjacency matrix I define that all nodes that have less than a certain geographical distance should be connected. The grid never changes, therefore the adjacency matrix stays always the same. Now my goal is to forecast graph at time t+1 from graph at time t. So graph at time t should be the input of my GNN and graph at time t+1 my output. Does the architecture make sense for that? It seems like the results are slightly better when leaving out the x = F.gelu(x) and x = self.fc(x) at the end. But in general the performance is very poor. Do you have an idea? Am I coneptually doing something completely wrong here?

rusty1s commented 10 months ago

If your adjacency matrix is static across time, then you can use the scheme of [batch_size, num_nodes, num_features] while edge_index only holds the connection of a single graph. If you concatenate edge_index multiple times, you need to make sure to also increase its indices properly as otherwise you only duplicate the edges, and basically convert your graph to a multi-graph.

Maria1810 commented 10 months ago

I couldn't figure out how to do it with [batch_size, num_nodes, num_features]. I said in my first question, that batch.x had size [20, 1521, 13] and I tried with batch.edge_index of size [20, 2, 55543] and also batch.edge_index of size [2, 55543]. But I couldn't get either option to work.

rusty1s commented 10 months ago

Option 2 should definitely work. Do you have an example to reproduce?

Maria1810 commented 10 months ago

Actually it does work now with edge_index being of size [2,5553] for the whole batch. I think the node_dim=1 was the problem. However my results are not really nice. I don't know why. My model looks like this:

`class GNNModel(nn.Module): def init(self, num_features, hidden_channels, num_feats_y): super(GNNModel, self).init()

    # Convolutional Message Passing Layers

    self.conv1 = GCNConv(num_features, hidden_channels[0])
    self.bn1 = BatchNorm(hidden_channels[0])
    self.conv2 = GCNConv(hidden_channels[0], hidden_channels[1])
    self.bn2 = BatchNorm(hidden_channels[1])
    self.conv3 = GCNConv(hidden_channels[1], hidden_channels[2])
    self.bn3 = BatchNorm(hidden_channels[2])
    self.conv4 = GCNConv(hidden_channels[2], hidden_channels[3])
    self.bn4 = BatchNorm(hidden_channels[3])
    self.conv5 = GCNConv(hidden_channels[3], num_feats_y)
    self.dropout = torch.nn.Dropout(p=0.1)

    # Dense layer for regression
    self.fc = nn.Sequential(nn.Linear(num_feats_y, 32), nn.GELU(), 
                            nn.Linear(32, 64), nn.GELU(),
                            nn.Linear(64, num_feats_y))

def forward(self, x, edge_index):
    # Message Passing Layers (GCNConv)

    x = self.conv1(x, edge_index)
    x = self.bn1(x)
    x = F.gelu(x)
    x = self.dropout(x)

    x = self.conv2(x, edge_index)
    x = self.bn2(x)
    x = F.gelu(x)
    x = self.dropout(x)

    x = self.conv3(x, edge_index)
    x = self.bn3(x)
    x = F.gelu(x)
    x = self.dropout(x)

    x = self.conv4(x, edge_index)
    x = self.bn4(x)
    x = F.gelu(x)
    x = self.dropout(x)

    x = self.conv5(x, edge_index)
    x = F.gelu(x)
    x = self.fc(x)

return x`

And my input data x has shape [batch_size x num_nodes, num_features] = [20 x 1521, 13] = [30420, 13] and my edge_index has shape [2,55543]. I do my training like this:

`model = GNNModel(13, [8, 32, 64, 128], 13) optimizer = torch.optim.AdamW( ddp_model.parameters(), lr=0.0001, weight_decay=0.0005) mse = torch.nn.MSELoss()

train_mse = [] for epoch in range( num_epochs ) :

  model.train()
  dataset_iter = iter(train_loader)

  train_mse_tmp = []
  for bidx, (input, edge_idx, target) in enumerate(dataset_iter) :

       optimizer.zero_grad()
       preds = ddp_model( input, edge_idx )
       loss = mse( preds, target)
       loss.backward()
       optimizer.step()

       train_mse_tmp.append(loss.item())
  train_mse.append(np.mean(train_mse_tmp))`
rusty1s commented 10 months ago

Ok, glad that it is running through now. Regarding model performance, this is hard to tell. Your model is very deep (which may be an issue for GCN, and your LR seems quite low.

songsong0425 commented 8 months ago

Hi, sorry for updating the discussion, but I need any ideas for a similar task. In my case, I have 1) a backbone network with initial node features only for the message passing and 2) multiple sets of node features with binary edge labels.

image (Please ignore the absence of node feature in the capture)

edge_index = torch.tensor([[0, 1, 1, 2, 3, 4, 4, 5, 6, 7, 8, 8, 9, 9, 9],
                           [1, 0, 2, 1, 4, 3, 5, 4, 7, 6, 8, 9, 8, 9, 7]], dtype=torch.long)

x1, x2, x3 = torch.randn(10, 5), torch.randn(10, 5), torch.randn(10, 5)
data1 = Data(edge_index=edge_index, x=x1, edge_label=torch.tensor([1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]))
data2 = Data(edge_index=edge_index, x=x2, edge_label=torch.tensor([1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1]))
data3 = Data(edge_index=edge_index, x=x3, edge_label=torch.tensor([0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0]))

image

I want to make the deep-learning model to learn the underlying pattern of datasets and do link prediction (i.e., edge classification) for the new datasets (only node features and edge labels). In the initial trial, I tried to use Batch.from_data_list() to get the distributed dataset per batch, but I thought that it couldn't learn the characteristics of each dataset since it split the dataset into parts. But I'm not sure if should I use the temporal GCN or dynamic graph since they are not sequential datasets.

Sorry for the messy question, but if you have any example code or opinion, please feel free to feedback. Thank you for reading this question!

rusty1s commented 8 months ago

What do you mean by temporal GNNs? I don't see any temporal information in your data. Using DataLoader/Batch.from_data_list()` would just be a way to compute your prediction in batches, but data is not shared among examples in the same batch, so I think you will be fine.

songsong0425 commented 8 months ago

Thank you for your kind comment! At that time, I thought that the model only could learn datasets one by one (train-val-test for dataset1, tr-val-ts for datset2, ...) and that's why I considered the temporal models. But if the data in Batch.from_data_lst() won't be shared, I'll try it again.