twistedcubic / attention-rank-collapse

[ICML 2021 Oral] We show pure attention suffers rank collapse, and how different mechanisms combat it.
Apache License 2.0
153 stars 10 forks source link

Maths comprehension problem #1

Open audibert-alexandre-fra opened 1 year ago

audibert-alexandre-fra commented 1 year ago

I had a question about the math part of the paper, even though my math level isn't very high. I don't understand the inequality on the left in the "Technical lemma" section. Why is the lower bound 1 - min(E_ij - Eij') and not 1 + min ( min(E_ij - Eij')). Thanks in advance.

twistedcubic commented 1 year ago

Hello @audibert-alexandre-fra, for the lower bound on $P{ij}$, we are taking the $min$ over $j'$, hence there is only one $min$ on the left hand side. We skipped some steps that we will add soon, the lower bound in the lemma indeed uses $1 + \min{j'} (E{ij} - E{ij'}) < \min{j'} \exp(E{ij} - E{ij'})$. When $E{ij} - E{ij'}\ge 0$, $1-\min{j'} (E{ij} - E{ij'}) \le 1+\min{j'} (E{ij} - E{ij'})$, hence the lower bound in the lemma follows. And when $E{ij} - E{ij'}<0$, since $\min{j'} (E{ij} - E{ij'}) \le \max( \delta_i^T E(\deltaj - \delta{j'}) )$, $1 - \max( \delta_i^T E(\deltaj - \delta{j'}) ) \le 1+\min{j'} (E{ij} - E_{ij'})$, hence the lower bound follows in this case as well.

In addition, for the upper bound on $P{ij}$ in this lemma, we have a slightly updated lemma and proof that uses an assumption that $\max{j'} |E{ij} - E{ij'}| < 1.256$. The updated argument divides the bound into cases depending on the sign of $E{ij} - E{ij'}$. We will post this updated version soon.

Best, Yihe