Clarifications of the method

Hello, after reading the paper, I had several questions regarding your approach. Thanks a lot in advance for taking the time to answer them.

Your embedding layer is more complex than usual: your initial node representation already seems to depend on its neighbour’s representation.

Is this beneficial ? Have you done experiments to show it ?

Graph construction: you use a smooth cutoff function and describe some benefits. You describe a Transformers but still use a cutoff value.

Is that statement correct ? Why ? So we do not capture long-range dependencies, right ? Is the smooth cutoff beneficial — you have seen something empirically to either motivate it or show its benefits ?

You say the feature vector are passed through a normalization layer.

Can you explain ? Including some motivation maybe.

An intermediate node embedding (y_i) utilising attention scores is created and impact final x_i and vi embeddings. This step weights a projection of each neighbor’s representation ~ $a{ij} (W \cdot RBF(d_{ij}) \cdot \vec{V}_j)$ by the attention score.

You use interatomic distances twice, don’t you ? Is weighting only by attention not enough theoretically ?

The equivariant message m_ij (component of sum to obtain w_i) is obtained by multiplying s_ij^2 (i.e. v_j scaled by RBF(d_ij)) by the directional info r_ij; then adding to it s_ij^1 (i.e. v_j scaled by RBF(d_ij)) re-multiplied by v_j.

Do you think that multiplying the message sequentially by distance info and directional info is the best choice to embed both info. type ? Why not concatenate r_ij (r_i - r_j) and d_ij (norm of r_ij = distance) info and have a single operation for instance ?
Is multiplying s_ij^1 by v_j (again) necessary ? (first in s_ij then by multiplying element-wise s_ij to v_j)
IMPORTANT. r_ij has dimension 3 while s_ij^2 has dimension F. In Eq (11), how can you apply an element-wise multiplication ? Is it a typo ? How exactly do you combine these two quantities ? What’s your take on the best way to combine 3D info (directional vector) with existing embedding ? This is a true question I am interested in, if you have references or insights on this bit…

Invariant representation involves the scalar product of the equivariant vector v_i, projected with matrix U1 by (U2 v_i).

What is the real benefit / aim of this scalar product ? Is a unique projection not enough ?

Hi Alex, thank you for your interest in the work.

embedding layer

In section 3.2 of the paper we report the results of ablating our modified neighborhood embedding. We found it to be beneficial when testing on QM9.

smooth cutoff

Yes it is correct that we apply a cutoff, which stops long-range interactions within each layer. However, throughout the layers the sphere of influence of each atom grows, just as with graph convolutions. This is analog to the increasing of the receptive field size in CNNs. We didn't test removing the cutoff all together, however, when increasing the cutoff distance performance tends to decrease. For QM9 for example we found a cutoff distance of 5Å to work best. I agree that, following the concept of Transformers, it makes sense to not limit the interactions by atomic distance but I don't think we currently have suitable datasets for this. I think this would require larger amounts of data, also including larger systems. I'm guessing the model would potentially perform better on the training set but fail to generalize and it probably wouldn't generate a smooth energy landscape, which is especially important if we want to use the models for MD simulations. Going forward, looking at larger datasets, it might make sense increasing the cutoff distance but it might still be hard to completely get rid of it when looking at large systems. This can be seen as a smooth version of limiting the size of the context available to the model.

normalization

Using normalization layers is very common practice in Transformers and many other modern architectures. They serve a role of limiting the magnitude of feature vectors and especially their gradients, which improves training. In the Transformer architecture layer normalization specifically ensures that feature vectors have a norm close to 1 going into the attention mechanism, making the dot product a more meaningful comparison between feature vectors.

intermediate node embedding

We justify this in the paper in the following way: "We place a continuous filter graph convolution (Schütt et al., 2017b) in the attention mechanism’s value pathway. This enables the model to not only consider interatomic distances in the attention weights but also incorporate this information into the feature vectors directly." Since attention weights are simply a scalar weighting of values this is not enough to embed information about interatomic distance into the actual feature vector. This is why we additionally perform the graph convolution step of values.

equivariant message

We are not embedding d_ij here. Instead, we use the values from the attention mechanism (which can encode distance information) to scale directional information (this is s_ij^2). We additionally perform a similar scaling of previous vector feature v_i, using s_ij^1. This is not necessarily the best way to do this but it is one that we found to work well. I'm not sure what you mean by concatenate r_ij (n_atoms x n_atoms x 3) and the distance d_ij (n_atoms x n_atoms). These tensors have different shapes.
I'm not sure what you mean by multiplying s_ij^1 by v_j again. As far as I see it s_ij^1 is only used once to scale vector features v_i by multiplying the two. It is returned by the attention mechanism where s_ij^1 represents the values from the attention mechanism after the graph convolution step. Note that V in the attention mechanism is not the same as the vector features v_i. In the attention mechanism V stands for values, which are derived from the scalar features x_i.
This is a good point, we could have made this more clear in the paper. In writing s_ij^2 * r_ij we implicitly match up their dimensions. These are the implicit steps before multiplying:
1. s_ij^2: (n_atoms x n_atoms x features) and r_ij: (n_atoms x n_atoms x 3)
2. s_ij^2: (n_atoms x n_atoms x features x 1) and r_ij: (n_atoms x n_atoms x 1 x 3)
3. s_ij^2: (n_atoms x n_atoms x features x 3) and r_ij: (n_atoms x n_atoms x features x 3)
The added dimensions are expanded by simply repeating the tensor along the new dimension (i.e. a broadcasting operation as used in Numpy and PyTorch). See this part of the code for further details. Note that the shapes I listed above are meant for illustrative purposes, the tensors in the code are shaped differently as we are following pytorch-geometric's way of representing graphs. Regarding the best way of combining 3D vectors with 1D features I don't think our current architecture nails this part, there definitely is room for improvement. This paper provides some more details about equivariance and how to incorporate directional information using spherical harmonics, which is computationally more expensive.

scalar product

This scalar product ensures rotational invariance of the scalar feature x_i. While v_i are rotationally equivariant, x_i are useful for predicting scalar quantities, which don't rotate. We adapted this idea from the PaiNN architecture (Schütt et al., 2021), which showed that the equivariant representation helps to disambiguate the conformational space of molecules and improves performance even for scalars.

Hope this helps, feel free to ask if you have further questions.

torchmd / torchmd-net

Clarifications of the method #144