Some questions regarding the paper

Hi, thanks for sharing the code.

I have read your paper recently and I have some questions about some details. I would be very grateful if you can help me with those questions.

How do you derive eqn.(4) ? As far as I know, Nystrom method only applies to semidefinite matrix, not the product of matrices, let alone the softmax of the matrix product. I do read through the reference above eqn.(4) ``Improving CUR Matrix Decomposition and the Nystr¨om Approximation via Adaptive Sampling'', but I failed to find any relevant results. Could you please point to me where can I find these results?
The other question is about the key simplification, which first uses Nystrom method inside the softmax then commutes the matrix production and the softmax operation (calcuate softmax first and then the matrix production). While the first step (Nystrom) is indeed unbiased, the second step (commute operations) is not necessary the case. In summary, it’s hard to say what the final equation (13) really is, and why it works at all.

mlpen / Nystromformer