scaomath / galerkin-transformer

[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax for Partial Differential Equations
MIT License
214 stars 28 forks source link

How does parameter initialization influencce performance? #2

Closed WangChen100 closed 2 years ago

WangChen100 commented 2 years ago

Hi Cao, I notice the parameter initialization in your code.

def _reset_parameters(self):
        for param in self.linears.parameters():
            if param.ndim > 1:
                xavier_uniform_(param, gain=self.xavier_init)
                if self.diagonal_weight > 0.0:
                    param.data += self.diagonal_weight * \
                        torch.diag(torch.ones(
                            param.size(-1), dtype=torch.float))
                if self.symmetric_init:
                    param.data += param.data.T
                    # param.data /= 2.0
            else:
                constant_(param, 0)

Does it influence the performance greatly? And why do you initialize the linear layers like this? Thank you very much!

scaomath commented 2 years ago

@WangChen100 Please check Table 9 on page 23 https://arxiv.org/pdf/2105.14995.pdf

I have not reported the effect of symmetric initialization, as I myself was getting inconsistent result across different examples which means it is problem dependent.

As the other two, dialing down the scale of the projection matrices has several advantages: (1) numerical stability, (2) benefit from neural ODE like integrator scheme. For details please refer to the answers to the reviewers at https://openreview.net/forum?id=ssohLcmn4-r

Similar tricks have been discovered in https://arxiv.org/abs/2108.12284