Bilinear layer randomness

First, thanks for the great work! This is really helpful in several ways.

While playing with your code, I encountered random behaviors of emlp, and figured that it is caused by the bilinear layer. I wish I have checked the issue #8 down below, Saving and Loading Objax EMLPs yields slightly different predictions, before trying to identify it myself. Two things:

It was suggested to use the same numpy random seed as a workaround. But, I'm checking if there is another way to resolve this, such as saving and loading additional parameters from the bilinear layer.
In fact, the only part in your paper and code that is unclear to me is the bilinear layer. I do not understand why there is randomness in the bilinear layer, if it is presumably calculating something like x^T A x + b x + c with projections. It would be really helpful to understand what it is, if the mathematical expressions for your bilinear layer is provided. Thanks.

Hi @tylee-fdcl, Sorry to be responding so late! 1) That should be possible, though the layer code may need to be changed (to add a seed which gets saved) for that to happen. Happy to review pull requests, but unfortunately I have very little bandwidth at the moment.

2) The bilinear layer is only described very briefly in the paper so I can understand the confusion. The bilinear layer is a little more complicated, since $x^T A x + b x + c$ would map from $V\to \mathbb{R}$ (if $x \in V$) but rather our bilinear layer maps from $V{\text{in}} \to V{\text{out}}$. Also, it does not include any linear or constant terms, so it is purely a bilinear map: $V{\text{in}} \times V{\text{in}} \to V{\text{out}}$. The mapping is somewhat difficult to describe just with matrices, but it essentially takes parts of $x$ (given by the multiplicities of the different representations making up $V{\text{in}}$) for which the type can be interpreted as a map from other parts of $x$ to parts of the output vector space $V_{\text{out}}$, each of these gets a separate scalar weight which is a learnable parameter. Where the randomness comes in is that the quadratic pairing is too large to have a weight for each pair, and so we limit the number of weights to be at most the size of the representation (I would need to look to remember the exact size) and randomly choose the weights to keep to satisfy this constraint.

An example would be say $V_{\text{in}} = T_1\oplus T_2 \oplus T3$ and $V{\text{out}}=V_{\text{in}}$ (assuming the representation is orthogonal). In this example, we would have a weight using the $T2$ of $V{\text{in}}$ to map the $T1$ of $V{\text{in}}$ to the $T1$ of $V{\text{out}}$, we would have a weight for using the $T3$ of $V{\text{in}}$ to map the $T_1$ to the $T_2$ in the output, and a separate weight to use the same $T_3$ to map the $T_2$ to the $T_1$ in the output. There would also be a $T_2$ mapping $T_1$ to $T_1$, and a $T_2$ mapping $T_2$ to $T_2$ and a $T_2$ mapping $T_3$ to $T_1$. This way, things like inner products, matrix-vector products, matrix-matrix multiplies, and higher-order tensor contractions can be represented.

mfinzi / equivariant-MLP

Bilinear layer randomness #21