mit-han-lab / qserve

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
Apache License 2.0
323 stars 8 forks source link

Question about the paper #10

Open jameswu2014 opened 1 month ago

jameswu2014 commented 1 month ago

20240518-124213 F is a nonlinear function, why they are equivalent?

synxlin commented 1 month ago

Hi, @jameswu2014, thank you so much for your interests in our work. Given that the matrix ($\Lambda$) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, $F(\mathbf{X}\mathbf{W}^T\Lambda ) = F(\mathbf{X}\mathbf{W}^T)\Lambda$ holds true in this context. Here $\mathbf{W}$ is the up_proj weights, $F(\mathbf{V}) = \mathbf{G} \odot \mathbf{V}$, and $\mathbf{G}$ is outputs of gate_proj, i.e., $F(\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = \mathbf{G} \odot (\mathbf{X}\mathbf{W}^T\mathbf{\Lambda}) = (\mathbf{G} \odot (\mathbf{X}\mathbf{W}^T)) \mathbf{\Lambda} = F(\mathbf{X}\mathbf{W}^T)\mathbf{\Lambda}$.

jameswu2014 commented 1 month ago

Hi, @jameswu2014, thank you so much for your interests in our work. Given that the matrix (Λ) is diagonal, for modern large language models (LLMs), the nonlinear activation function commonly used is the SwiGLU, which is element-wise matrix multiplication, therefore, F(XWTΛ)=F(XWT)Λ holds true in this context. Here W is the up_proj weights, F(V)=G⊙V, and G is outputs of gate_proj, i.e., F(XWTΛ)=G⊙(XWTΛ)=(G⊙(XWT))Λ=F(XWT)Λ.

Got it, Firstly, thank you for your reply. But I still have several questions about it.

  1. you mean G(outputs of gate_proj) do not need to be rotated? or onlining rotated?
  2. the silu op is after gate_proj, so G's precision is fp16, and also silu's output is fp16, gate_proj gemm is int4xint8->fp16? Is it right?
synxlin commented 1 month ago

Hi @jameswu2014. For your questions,