microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
20.16k stars 2.55k forks source link

[RetNet] Equation in the paper #1366

Closed kenkenpa2126 closed 12 months ago

kenkenpa2126 commented 12 months ago

In the RetNet paper, equation (3) is simplified by making $\gamma$ a scalar, leading to the derivation of the following:

$$ on = \sum{m=1}^{n} \gamma^{n-m} (Q_n e^{in \theta}) ( K_m e^{im \theta} )^{\dagger} v_m \tag{4} $$

However, I find this peculiar because the shape of $\theta \in \mathbb{R}^d$ and that of $e^{i n \theta}$ is $1 \times d$. Since $Q_n \in \mathbb{R}^{1 \times d}$, I believe the calculation of $Q_n e^{in \theta}$ is not feasible.

Before $\gamma$ was converted to a scalar, it was a vector with the same shape as $\theta$; $\gamma, \theta \in \mathbb{R}^d$. I understand $\gamma$ and $\theta$ are represented as $\gamma =\left[ \gamma_1, \ldots, \gamma_d \right]^T$ and $\theta = [\theta_1, \ldots, \theta_d ]$, respectively. Then, $\gamma e^{i \theta}$ is somehow calculated as follows (I don't know why the components other than the diagonal become 0. ):

$$ \gamma e^{i \theta} = \begin{pmatrix} \gamma_1 e^{i \theta_1} & 0 & \cdots & 0 \ 0 & \gamma_2 e^{i \theta_2} & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \gamma_d e^{i \theta_d} \end{pmatrix} $$

However, if we convert $\gamma$ to a scalar, we must redefine $\theta$ as a diagonal matrix of $\theta \in \mathbb{R}^{d \times d}$ as follows:

$$ \theta^{\prime} = \begin{pmatrix} \theta_1 & 0 & \cdots & 0 \ 0 & \theta_2 & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \theta_d \end{pmatrix} $$

Then, $\gamma e^{i n \theta}$ maintains the same shape and is described as below:

$$ \gamma e^{i n \theta^{\prime}} = \begin{pmatrix} \gamma e^{i n \theta_1} & 0 & \cdots & 0 \ 0 & \gamma e^{i n \theta_2} & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \gamma_d e^{i n \theta_d} \end{pmatrix} $$

If we denote $Q_n = [ q_1, \ldots, q_d ]$, then we can express $Q_n e^{i n \theta^{\prime}}$ with the original $\theta \in \mathbb{R}^{d}$, without using the newly redefined $\theta^{\prime} \in \mathbb{R}^{d \times d}$:

$$ Q_n e^{i n \theta^{\prime}} = [ q_1, \ldots, q_d ] \begin{pmatrix} e^{i n \theta_1} & 0 & \cdots & 0 \ 0 & e^{i n \theta_2} & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & e^{i n \theta_d} \end{pmatrix} = [ q_1 e^{i n \theta_1} , \ldots, q_d e^{i n \theta_d} ] = Q_n \odot [ e^{i n \theta_1} , \ldots, e^{i n \theta_d} ] = Q_n \odot e^{i n \theta} $$

Therefore, I personally believe that equation (4) should be expressed as follows:

$$ on = \sum{m=1}^{n} \gamma^{n-m} (Q_n \odot e^{in \theta}) ( K_m \odot e^{im \theta} )^{\dagger} v_m \tag{4} $$

In addition, I'd like to ask why the components other than the diagonal become 0 in $\gamma e^{i \theta}$, when $\gamma, \theta \in \mathbb{R}^d$.

kenkenpa2126 commented 12 months ago

On further reflection, I understand that $\gamma e^{i \theta}$ represents the matrix whose diagonal elements are $\gamma_j e^{i \theta_j}$, instead of representing the matrix product of $\gamma \in \mathbb{R}^d$ and $e^{i \theta} \in \mathbb{C}^d$.

$$ \gamma e^{i \theta} = \begin{pmatrix} \gamma_1 e^{i \theta_1} & 0 & \cdots & 0 \\ 0 & \gamma_2 e^{i \theta_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \gamma_d e^{i \theta_d} \end{pmatrix} $$

After scalifying the $\gamma$, $e^{i \theta}$ is described as below:

$$ e^{i \theta} = \begin{pmatrix} e^{i \theta_1} & 0 & \cdots & 0 \\ 0 & e^{i \theta_2} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & e^{i \theta_d} \end{pmatrix} $$

If so, equation (4) in the paper is feasible.

I believe this aligns with the paper's intent, but it's confusing because $e^{i \theta}$ is represented as $\mathbb{R}^{d \times d}$ even though $\theta \in \mathbb{R}^{d}$. It should be clarified that $e^{i \theta}$ represents the $\mathbb{R}^{d \times d}$ matrix whose diagonal elements are $e^{i \theta_j}$, derived from $\theta \in \mathbb{R}^{d}$.

sunyt32 commented 12 months ago

Yeah, you are right about the shape of $\theta$. It's better to be written as : $$e^{i\theta}=\mathbf{diag}(e^{i\theta_1}, e^{i\theta_2},...,e^{i\theta_n})$$