Open jameswu2014 opened 1 month ago
F is a nonlinear function, why they are equivalent?
Hi, @jameswu2014, thank you so much for your interests in our work. Given that the matrix ($\Lambda$) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, $F(\mathbf{X}\mathbf{W}^T\Lambda ) = F(\mathbf{X}\mathbf{W}^T)\Lambda$ holds true in this context. Here $\mathbf{W}$ is the up_proj weights, $F(\mathbf{V}) = \mathbf{G} \odot \mathbf{V}$, and $\mathbf{G}$ is outputs of gate_proj, i.e., $F(\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = \mathbf{G} \odot (\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = (\mathbf{G} \odot (\mathbf{X}\mathbf{W}^T)) \mathbf{\Lambda} = F(\mathbf{X}\mathbf{W}^T)\mathbf{\Lambda}$.
Hi, @jameswu2014, thank you so much for your interests in our work. Given that the matrix (Λ) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, F(XWTΛ)=F(XWT)Λ holds true in this context. Here W is the up_proj weights, F(V)=G⊙V, and G is outputs of gate_proj, i.e., F(XWTΛ)=G⊙(XWTΛ)=(G⊙(XWT))Λ=F(XWT)Λ.
Got it, Firstly, thank you for your reply. But I still have several questions about it.
Hi @jameswu2014. For your questions,
gate_proj
is not rotated.