Question about Memory-efficient self-attention

Hi, it is a nice job about utilizing swin transformer to point cloud. However, I really don't understand the content of Memory-efficient self-attention.

$f{i,h}^{*}=\frac{\sum{j=1}^{N}(exp(e{ij,h})f{j}W{V,h})}{\sum{j=1}^{N}exp(e_{ij},h)}----(3)$

how can I understand the idea of allowing to postpone the SoftMax normalization and avoid constructing and storing ${αij,h}$ explicitly.

Calculating the denominator and numerator of Eq. (3) simultaneously is also a question that hard to fully understand.

Could you please give me some tips about how to grasp the idea of Memory-efficient self-attention.

microsoft / Swin3D

Question about Memory-efficient self-attention #24