mlpen / Nystromformer

Apache License 2.0
356 stars 41 forks source link

Some questions regarding the paper #10

Closed wzm2256 closed 3 years ago

wzm2256 commented 3 years ago

Hi, thanks for sharing the code.

I have read your paper recently and I have some questions about some details. I would be very grateful if you can help me with those questions.

  1. How do you derive eqn.(4) ? As far as I know, Nystrom method only applies to semidefinite matrix, not the product of matrices, let alone the softmax of the matrix product. I do read through the reference above eqn.(4) ``Improving CUR Matrix Decomposition and the Nystr¨om Approximation via Adaptive Sampling'', but I failed to find any relevant results. Could you please point to me where can I find these results?

  2. The other question is about the key simplification, which first uses Nystrom method inside the softmax then commutes the matrix production and the softmax operation (calcuate softmax first and then the matrix production). While the first step (Nystrom) is indeed unbiased, the second step (commute operations) is not necessary the case. In summary, it’s hard to say what the final equation (13) really is, and why it works at all.

yyxiongzju commented 3 years ago

@wzm2256, Thanks for your interest in this work.

1, No, Nystrom method can be applied to deal with general matrix, not just semidefinite matrix. Eq. 1 in the reference is for SPSD matrix. But it does not mean it can only be used for SPSD matrix. And it is not hard to derive it. You can follow the initial idea of Nystrom to derive it.

2, First, it is easy to see the equation (13) can approximate the true softmax matrix in self-attention exactly if using all n landmarks. Second, the error between the approximate softmax matrix and true softmax matrix is bounded, which can be seen in the supplement.