Architecture of the aggregation in HeteroSAGEConv?

Hello! Not really an issue but I have a question about the implementation of the update step in hetero_gnn.py. What is the benefit of calculating the output via these lines:

aggr_out = self.lin_neigh(aggr_out)
node_feature_self = self.lin_self(node_feature_self)
aggr_out = torch.cat([aggr_out, node_feature_self], dim=-1)
aggr_out = self.lin_update(aggr_out)

so applying a linear layer to the aggregated neighbour features and another linear layer to features of the node itself, and afterwards applying another layer to the concatenation of the results? In terms of the weights matrix multiplications this represents:

$W_{y} \begin{bmatrix} W_{u}x_{u}+b_{u} \\W_{\nu}x_{\nu}+b_{\nu} \end{bmatrix} + b_{y}$

I thought it would be simpler to use just

aggr_out = torch.cat([aggr_out, node_feature_self], dim=-1)
aggr_out = self.lin_update(aggr_out)

where self.lin_update is now initialised as self.lin_update = nn.Linear(self.in_channels_self + self.in_channels_neigh, self.out_channels) and we don't need the linear layers self.lin_neigh and self.lin_self anymore?

This represents something like

$W_{y}\prime CONCAT(x_{u}, x_\nu) +b_{y}\prime,$

where CONCAT is the vector concatenation operator and the prime indicates that we now have a different dimension for W_y and b_y.

In terms of the number of parameters in the model it doesn't make a huge difference but by including these additional layers, you have a more complex optimisation surface that involves a product of weights matrices. Would this not make it a bit harder for the gradient descent algorithm to get to a good solution?

Thank you for any explanation you can provide for the benefits of the slightly more complex architecture implemented in deepsnap!

snap-stanford / deepsnap

Architecture of the aggregation in HeteroSAGEConv? #29