microsoft / gated-graph-neural-network-samples

Sample Code for Gated Graph Neural Networks
MIT License
1.03k stars 264 forks source link

tying layers #9

Closed JacksonGL closed 6 years ago

JacksonGL commented 6 years ago

Thanks for sharing the ggnn code. I am a little confused by the concept of layers. It seems that in the earlier version of the codebase, each layer represents one timestep, and there is an option called tie_gnn_layers to toggle the sharing of weights, biases, and rnn cells across all time steps. Later, I noticed that after a commit, layers no longer share those tensors and cells. Does that mean a layer now represent an actual physical layer in the model instead of a time step? Thanks.

PS: since tying layers is no longer an option, the README file still includes an obsolete parameter "tie_gnn_layers": false for GCN. The comment about tying layers also seems to be obsolete.

mmjb commented 6 years ago

Thanks for pointing these discrepancies out. I've updated the code in 29cda77 to remove the misleading comments.

To answer your actual question: The layer_timesteps hyperparameter allows fine-grained control over weight-sharing and the number of layers. For example, layer_timesteps: [8] would specify one layer with 8 timesteps, whereas the current default layer_timesteps: [2, 2, 1, 2, 1] specifies 5 layers, in which the third and fifth are run for one time step, and the others for two timesteps. [The old code only allowed to select the overall number of timesteps, and tie_gnn_layers toggled if those were all using the same weights or not.]

This new configuration option now works well with the use of residual connections. For example, in the default setting, the third layer (run for one time step) takes as input the initial node embedding as well as the output of the second layer. Similarly, the fifth layer (run for one time step) takes as input the initial node embedding, the output of the second layer, and the output of the fourth layer.

Hope that helps!

JacksonGL commented 6 years ago

Thanks for fixing the issue and answering my question.

I have one more question: why does the sparse implementation allow each layer to have a separate weight matrix and bias matrix? Since we are just unrolling the graph, I think all layers should share the same edge weights and biases. It seems that by default all layers share those parameters in the dense implementation of the GGNN, and there is no option to untie the layers. Thanks.

mmjb commented 6 years ago

Originally, I implemented this feature to match the GCN behaviour, where different weights per timesteps were found to be helpful. Overall, I have not found just unying the weights super-helpful in the GNN setting (this may also be due to the fact that the GCN experiments were usually reported on a single, larger graph; whereas my experiments usually consider thousands of graphs).

However, this untying is required for the residual connections, as I implemented them by concatenating the residual "message" to the incoming message at that timestep, which changes the dimensionality of the input of the gated cell used for the state update function. With these, I found substantial improvements on the chemistry dataset here in the repo, as well as on other internal datasets. They also help with training GNNs with many timesteps, which tend to suffer from their deep computation graphs in the same ways as RNNs on long sequences.

JacksonGL commented 6 years ago

I see. Thanks for sharing the interesting findings.