Closed knagrecha closed 4 months ago
All credit to Yao Fu for idea @FranxYao
Just saw this, as I thought of the exact idea yesterday. Minor: the softmax operation at the end isn't required -- just the max of the Gumbel perturbed logits.
Just saw this, as I thought of the exact idea yesterday. Minor: the softmax operation at the end isn't required -- just the max of the Gumbel perturbed logits.
Good point, and yes we should change this to just perturbation. I used the gumbel softmax for a quick test since it was easily available as a pre-defined op but the softmax is just pointless overhead. In addition, we should ignore this altogether if temperature is set to 0 and just go to regular argmax.
User can now configure temperature in the generate function. @Viol2000 let me know what you think. I also updated minimal.py to add an example of sampling. From my quick test it seems to maintain the speedup in the sampling setting too, about 1.46X faster throughput with lade on vs without.
Hi @knagrecha , thanks for all your efforts in implementing the sampling method!
From my point of view, Gumbel is a brilliant method for sampling, but I wonder whether it can help if we want to maintain the output sampling distribution in the same way as the method without using lookahead decoding. I think we need to do some rejection sampling in the verification branch to keep the output distribution, like speculative decoding. A sampling in the lookahead branch (i.e., variable past_tokens) does not seem perfect to maintain the output distribution.
What's your opinion? Let's discuss!
Hi @Viol2000, thanks for the quick response. Let me get back to you after I think about some proof for equivalency between the two methods (or if such a proof is not possible, then a modification of the implementation that will allow such a proof). I agree that from a research standpoint, maintaining the distribution is important, otherwise we change the semantics of "sampling". Could you share any formalization you have demonstrating that look-ahead decoding with n-gram is equivalent for the greedy decoding case? This may help as a starting point.
From a product/OSS tool perspective, I think it is important to have some kind of sampling support in place for lookahead decoding adoption. Anecdotally, most of the real-world LLM deployments I have seen prefer sampling over greedy decoding. I wonder if we can merge this as temporary support for sampling, and note in the repo that "we hope to verify and support mathematically equivalent sampling soon".
I imagine that for some large number of users, the actual functionality of temperature-modified sampling is more critical than maintaining the distribution. But I also see a risk of confounding users with misunderstandings if the change is not surfaced properly. What are your thoughts?
@knagrecha, I saw this an hour and a half ago when you pinged the issue on huggingface/transformers
.
As mentioned in https://github.com/vllm-project/vllm/issues/1742, I'm still concerned about experimentally verifying that the approximate distributions are of high enough quality (i.e. the KL-divergences between the true and approximate conditional next token distributions are low enough).
I spent the past bit mathing it out:
For the sake of brevity, let $[K]$ represent the integer interval $\{1, \ldots, K\}$.
Suppose $P\sim Cat(K, \mathbf{p}), Q\sim Cat(K, \mathbf{q})$. Let $\mathbf x$ and $\mathbf y$ be the logits of $\mathbf p$ and $\mathbf q$, respectively.
Fix the Gumbel seed $z_k\overset{i.i.d.}{\sim} Gumbel(0,1)$, $k\in [K]$. Let $M = \mathrm{argmax}_{k\in [K]}(x_k+z_k)$ and $L = \mathrm{argmax}_{k\in [K]}(y_k+z_k)$.
$$ \begin{align} \mathbb P(M=L)&=\sum{k\in [K]}^K \mathbb P(M=k) \mathbb P(L=k)\ &=\sum{k\in [K]}^K\sigma_k(\mathbf x) \sigma_k(\mathbf y)\ &=\mathbf p \cdot \mathbf q\ \end{align} $$
$$ \begin{align} \mathbf p \cdot \mathbf q\ge\mathbf p \cdot (\frac{1}{e^\epsilon}\mathbf p)\ge\frac{1}{e^\epsilon} \end{align} $$
But since $D(P||Q)=\frac{1}{\log 2}D_{KL}(P||Q)\ge\frac{1}{2\log 2}\lVert\mathbf p-\mathbf q\rVert_1^2$[^1], $\epsilon$ is also lower-bounded accordingly.
I'm not sure if the notation is the best, but I think I got the point across.
[^1]: Also known as Pinsker's inequality. Cover and Thomas (1991, Lemma 12.6.1, pp. 300–301. Also see Lemma 361 here from Shalizi (2006).
... And I just realized I spent an hour typing it out. 🤣
For sampling, token equivalence is not necessary for acceptance if the KL-divergence between the true and approximate distributions is small (but it certainly is a lot faster to compute!). I just hope that for high entropy distributions (where the probabilities are closer together), that the token acceptance rate won't drop too much! (Obviously, the empirical frequency of encountering high entropy distributions needs to be considered, especially at common sampling temperatures).
You could say I'm a bit concerned about the tradeoffs between text quality and speed.
But then again, what matters most is whether the text generated from approximate sampling is as preferable as that from true sampling, as judged by a human reader.
Also, regarding rejection sampling... Hopefully there can be some nice concentration equalities showing that the approximate distribution is "good enough" if you have enough samples?
In general, I'm skeptical whether using lookahead decoding for sampling would be effective.
Using a diffusion language model with adversarial diffusion distillation (to reduce the number of decoding steps to 1) might work, but is still largely unexplored territory.
Hi @shermansiu, thanks for all the great efforts in putting together this mathematical formalization. However, I have some doubts on a few of the steps.
$$ \begin{align} \mathbb P(M=L)&=\sum_{k\in [K]}^K \mathbb P(M=k) \mathbb P(L=k) \end{align} $$
In this line, you express the total probability that the argmax of both distributions is equal to the same value. However this breakdown of the probability is only be possible if M and L are independent, which is unlikely since they both use the same Gumbel noise vectors.
I also assume for the next step to work, that p and q are the softmax-transformed logits (not the raw logits).
Second half
This part also seems problematic to me:
Since $D{KL}(P||Q)\le \epsilon$, $\sum\limits{k\in [K]}^K p_k\log(\frac{p_k}{q_k}) \le \epsilon$ and as $\mathbf p$ sums to 1, $\log(\frac{p_k}{q_k})\le \epsilon$ for all $k$
Let me give a very simplistic counterexample where this inference about the summation would not work. Ignore whether the values are realistic, just a demonstration.
$$p_0 = 0.9, p_1 = 0.1, q_0 = 0.8, q_1 = 0.2.$$ $$\epsilon=0.1$$
$$D_{KL}(p||q) = 0.9 \log(\frac{0.9}{0.8}) + 0.1 \log(\frac{0.1}{0.2}) \approx 0.0367$$
$$\log(\frac{p_0}{q_0})=\log(\frac{0.9}{0.8}) \approx 0.1178$$
$$\log(\frac{p_1}{q_1}) = \log(\frac{0.1}{0.2}) \approx -0.693$$
Notice that $0.11768 > \epsilon$, thus showing that this summation property does not hold.
These counterexamples do not prove any equivalence of course...only that the final inequality & lower bound derived may not be correct.
From my perspective, the effectiveness of Gumbel sampling on its own is less of an issue; it has already become a standard approximation option for sampling transformations in prior works [1,2,3]. The point of equivalence between temperature-configured softmax sampling and Gumbel-softmax is important, but at least we can draw on prior examples there.
What I think may be a bigger issue is the interaction between sampling and n-gram selection, rather than single-token selection. This introduces another potentially complicating factor. In other words, the major problem is more specific to lookahead decoding (and its n-gram design) than the general effectiveness of Gumbel-softmax.
[1] https://dl.acm.org/doi/10.1145/3604915.3608779 [2] https://arxiv.org/pdf/2103.08862.pdf [3] https://aclanthology.org/P19-2049.pdf
From my perspective, the effectiveness of Gumbel sampling on its own is less of an issue; it has already become a standard approximation option for sampling transformations in prior works [1,2,3].
I'm not debating the effectiveness of Gumbel sampling at all. It first appeared in [1] and I found out about it a year after its publication. (Edit: I found an even earlier work from 2013[2] that mentions it frequently being used as a trick). (Edit 2: There's also Gumbel's original lecture transcripts from 1954[3]).
Assuming that you're using token equivalence as the token acceptance criterion, my goal is to relate the probability of token acceptance to the KL-divergence between the true (i.e. what the probability distribution we would be sampling from if using regular auto-regressive sampling) and approximate (i.e. the probability distribution obtained from looking ahead several steps in the future) next token distributions.
Basically, the probability distribution when we look ahead is different than the one we would obtain without lookahead sampling and this could cause problems.
What I think may be a bigger issue is the interaction between sampling and n-gram selection, rather than single-token selection. This introduces another potentially complicating factor. In other words, the major problem is more specific to lookahead decoding (and its n-gram design)
I agree, you'd be looking at the n-gram acceptance. But the probability of n-gram acceptance depends on the probability of single-token acceptance.
The reason why I'm focusing on single-token acceptance is because the probability of accepting an n-gram is less than the probability of accepting its first token (by the chain rule of probabilities and how each probability is between 0 and 1). If we have problems with accepting the first token when the true and lookahead distributions are only slightly different, then well, we have a problem.
than the general effectiveness of Gumbel-softmax.
I want to repeat that I'm not criticizing its effectiveness at all. (In fact, I agree it has been standard for many years at this point...) Just trying to calculate a bound on single-token acceptance probabilities when we have constraints on the KL-divergence.
[1] https://arxiv.org/abs/1611.01144 [2] https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/ [3] https://ntrl.ntis.gov/NTRL/dashboard/searchResults/titleDetail/PB175818.xhtml
is only be possible if M and L are independent,
Interestingly enough, the question of whether they are independent boils down to whether the values of $x_k$ and $y_k$ are independent.
For notational convenience, let $a_k = x_k+z_k$ and $b_k = y_k+z_k$.
Because we select $M$ through the argmax over each $a_k$, $\mathbb{P}(M=k) = \mathbb{P}(a_k\ge a_i, \forall i\neq k)$.
$\mathbb{P}(M=L)= \sum\limits_{k\in [K]}^K \mathbb P(M=k \wedge L=k)$
Let's examine the inner term more closely. $\mathbb P(M=k \wedge L=k) = \mathbb P(M=k | L=k)\mathbb P (L=k)$.
$\mathbb P(M=k | L=k) = \mathbb P(a_k\ge a_i, \forall i\neq k | b_k\ge b_i, \forall i\neq k)$
If the values of $x_k$ and $y_k$ don't probabilistically depend on each other, then $\mathbb P(M=k | L=k)=\mathbb P(M=k)$.
In my original formulation, I didn't constrain p and q to necessarily come from the language model itself: they are just two probability vectors. So the original result, given two arbitrary (constant) probability vectors, still stands. (Recall: a probability vector is a non-negative real vector that sums to 1.)
But in real life, $\mathbf p$ and $\mathbf q$ do depend on each other.
Suppose, WLOG, that $x_k$ is the true next token probability and $yk$ is approximate. Suppose we have an autoregressive language model $f\theta$ parameterized by weights $\theta$ that takes in tokens and returns a next token probability vector. Suppose that we have string of tokens $\mathbf s$ and $\mathbf t$, where they share a common prefix $\mathbf t{1:m-1}$ and $\mathbf s=\mathbf t{1:m-1}\mathbf s_{m:\ell}$. Both $\mathbf s$ and $\mathbf t$ are of length $\ell$. Suppose that $t$ is the target token sequence.
If we are looking at sampling the $\ell$-th token, we could set $\mathbf p=f\theta(\cdot|\mathbf t{1:\ell-1})$ and $\mathbf q=f\theta(\cdot|\mathbf s{1:\ell-1})$. But obviously, both values depend on $f\theta(\cdot|\mathbf t{1:m-1})$ (because the tokens are generated autoregressively, so you use the chain rule to decompose the product of probabilities to find the common multiplicative factor).
which is unlikely since they both use the same Gumbel noise vectors.
It actually depends on the probabilistic dependence between the true and approximate logits, not on whether they use the same Gumbel noise vectors. If the values of $x_k$ didn't depend on $y_k$, then $\mathbb P(M=k | L=k)= \mathbb P(a_k\ge a_i, \forall i\neq k | b_k\ge b_i, \forall i\neq k)= \mathbb P(a_k\ge a_i, \forall i\neq k)=\mathbb P(M=k)$, which would lead to the result above.
In fact, if $\mathbf p = \mathbf q$, then $\mathbf{P}(M=L) = \mathbf p\cdot\mathbf p = 1$ as expected. So no, the fact that they use the same Gumbel noise vectors is already taken into consideration.
I also assume for the next step to work, that p and q are the softmax-transformed logits (not the raw logits).
Yes, $\mathbf p$ and $\mathbf q$ are the softmax-transformed logits (i.e. the categorical event probability vectors). But it turns out that $\mathbf x$ and $\mathbf y$ are the log probabilities, not the logits.
I noticed that in your blog, $\mathbf x$ and $\mathbf y$ are used to refer to the tokens... Sorry for any confusion there.
Second half
This part also seems problematic to me:
Since $D{KL}(P||Q)\le\epsilon$, $\sum\limits{k\in[K]}^K p_k\log(\frac{p_k}{q_k})\le\epsilon$ and as $\mathbf p$ sums to 1, $\log(\frac{p_k}{q_k})\le\epsilon$ for all $k$
You're correct here. I made three mistakes in my chain of thought here:
Coming up with a tight lower bound needs more work.
Also, regarding the probability of n-gram equality... Obviously, we know that the first tokens ($s_m$ and $tm$) are identical, because the inputs to the language model are equivalent for both. So $\mathbf s{1:m}=\mathbf t_{1:m}$. If $\ell$ is the index of the last token of the candidate n-gram, then $n=\ell-m+1$.
So $\mathbb{P}(\mathbf s=\mathbf t) = \mathbb{P}(\mathbf s{m+1:\ell}=\mathbf t{m+1:\ell})=\mathbb{P}(s{\ell}=t{\ell}|\mathbf s{m+1:\ell-1}=\mathbf t{m+1:\ell-1})\mathbb{P}(s{\ell-1}=t{\ell-1}|\mathbf s{m+1:\ell-2}=\mathbf t{m+1:\ell-2})\cdots\mathbb{P}(\mathbf s{m+1:\ell}=\mathbf t{m+1:\ell})$.
So assuming that the token acceptance criterion is token equivalence (I actually don't know how you've decided to accept tokens when sampling under Gumbel-Max trick... I'm just assuming you're checking if the n-grams are equal instead of comparing KL-divergences, which is costly), the probability for accepting an n-gram depends on (and is upper-bounded by) the probability of accepting the first non-trivial token (i.e. the second token). In other words, $\mathbb{P}(\mathbf s=\mathbf t)\le \mathbb{P}(\mathbf s{m+1:\ell}=\mathbf t{m+1:\ell})$.
Let's call the claim that $\mathbb{P}(M=L) = \mathbf p\cdot \mathbf q$, Lemma $\star$. Because lemma $\star$ does not assume where the probability vectors come from, we can actually apply it to each of the terms in the above product (obviously, some work would be required to show that).
It turns out that for the lower bound, you can use Pinsker's inequality. Suppose $D_{KL}(P||Q) \le \epsilon$.
First, recall that for any two probability vectors, their inner product is at most 1 (by Cauchy-Schwartz, L2 norm $\le$ L1 norm, L1 norm is 1 for probability vectors) and is equal to 1 iff the two vectors are equal.
By Pinsker's inequality, $\frac12\lVert \mathbf p-\mathbf q\rVert_1^2\le \epsilon$. We rearrange this to $\lVert \mathbf p-\mathbf q\rVert_1\le \sqrt{2\epsilon}$.
$$ \begin{align} \lVert \mathbf p\rVert2^2-\mathbf p\cdot \mathbf q&=\mathbf p\cdot(\mathbf p-\mathbf q)\ &\le \mathbf p\cdot|\mathbf p-\mathbf q|\ &\le \sum{k\in[K]}^K|p_k-q_k|\ &=\lVert \mathbf p-\mathbf q\rVert_1\ &\le \sqrt{2\epsilon} \end{align} $$
We rearrange this inequality to obtain: $\lVert \mathbf p\rVert_2^2-\sqrt{2\epsilon} \le\mathbf p\cdot \mathbf q=\mathbb{P}(M=L)$. Due to symmetry, we can have $\max(\lVert \mathbf p\rVert_2^2,\lVert \mathbf q\rVert_2^2)-\sqrt{2\epsilon} \le\mathbf p\cdot \mathbf q=\mathbb{P}(M=L)$. (Note that $\frac{1}{n}\le\lVert\mathbf p\rVert_2^2 \le 1$).
For this inequality to have any guarantees, $D{KL}(P||Q)$ must be below $\frac12$ and for us to guarantee at least a 50% probability of token acceptance, we need $D{KL}(P||Q)$ to be below $\frac18$, which is a tough call.
Edit: This only is correct up to the point where Lemma $\star$ is used (i.e. $\mathbf p\cdot \mathbf q = \mathbb{P}(M=L$)) because Lemma $\star$ only works if the Gumbel noise vectors are different iid vectors, not the same exact ones. It's valid if we're using different iid Gumbel noise vectors though.
Perhaps I need to re-examine the claim that $\mathbb{P}(M=k)$ and $\mathbb{P}(L=k)$ are independent.
Suppose we know $\mathbf x$ and $\mathbf y$ ahead of time.
Recall: $\mathbb P(M=k | L=k) = \mathbb P(a_k\ge a_i, \forall i\neq k | b_k\ge b_i, \forall i\neq k)$. $\mathbb P(a_k\ge a_i, \forall i\neq k | b_k\ge b_i, \forall i\neq k)=\mathbb P(a_k\ge a_i, \forall i\neq k)$ iff they are independent.
Consider the following system of $K-1$ inequalities of the form $b_k-b_i\ge 0, \forall i\neq k$. Since $x_k$ and $y_k$ are known for all $k\in[K]$, we can attempt to derive inequalities of the form $a_k-ai\ge \Delta{xy}^{(i)}, \forall i\neq k$, where $\Delta_{xy}^{(i)}=x_k-y_k-x_i+yi$. If we can confirm that $\Delta{xy}^{(i)}\ge 0$ for all $i\neq k$, then obviously $M=k$. So in some cases, we can determine with certainty, whether $\mathbb{P}(M=k)$. Therefore, it's not independent from $\mathbb{P}(L=k)$.
...But then you might ask, since $\mathbf x$ and $\mathbf y$ are log probabilities (and not logits in $\mathbb{R}^K$), don't $\exp(\mathbf x)$ and $\exp(\mathbf y)$ lie on the standard $K-1$-simplex? So how can we guarantee that it's even possible for $\Delta_{xy}^{(i)}\ge 0$ for all $i\neq k$?
I need to stop going down this rabbit hole.
It turns out that if you add $\log \sum_{k} e^{p_k}$ (i.e. a constant) to $x_k$, you get the logits. And if the normalization constant is equal to 1, then the log probabilities are the logits. I suppose for a large part of what I said earlier, we can say WLOG, that $\mathbf x$ and $\mathbf y$ are actually the logits. (Especially when you have an inequality where it appears on both sides of the inequality, because then you'd just need to add a constant term to both sides).
I'm going to lay off on the theoretical side until we can experimentally demonstrate that lookahead decoding with sampling actually leads to speedups.
Lemma $\star$ only works for iid Gumbel noise vectors, not the same ones. knagrecha is correct: $\mathbf p \cdot \mathbf q$ is the probability assuming that the events are independent (i.e. with different Gumbel noise vectors). Also, $\mathbf p \cdot \mathbf p < 1$ unless one of the components is equal to 1 (probability vectors have a L1 norm of 1, not a L2 norm of 1).
$\mathbf x$ and $\mathbf y$ are softmax logits, which are unconstrained in $\mathbb R$, unlike sigmoid logits, because the other components are also unconstrained in $\mathbb R$. For sigmoid logits, there is an implicit extra logit with value 0 (recall the sigmoid function is $\frac{e^x}{e^0 + e^x}$), which limits the values that $x$ can take. A softmax logit is only constrained if both the corresponding probability and all of the other logits are known.
If we are to use Weinberger's notation for the proof of the Gumbel-max trick, then
$$ \mathbb P (I=\omega\cap J=\omega)=\mathbb E_M\mathbb EN [\mathbb P(G{\thetak} < M~\forall k\neq\omega \cap G{\phi_k} < N~\forall k\neq\omega )] $$
and because each Gumbel variable is independent, we can write the above probability as:
$$ \mathbb P (I=\omega\cap J=\omega)=\mathbb E_M\mathbb EN [\prod{k\neq \omega} \mathbb P(G_{\thetak} < M \cap G{\phi_k} < N)] $$
We can rewrite the factor in the above equation as follows:
$$ \begin{align} \mathbb P(G_{\thetak} < M \cap G{\phi_k} < N)&=\mathbb P(\thetak + G^{(k)} < \theta\omega + G^{(\omega)} \cap \phik + G^{(k)} < \phi\omega + G^{(\omega)})\ &=\mathbb P(\thetak -\theta\omega + G^{(k)} < G^{(\omega)} \cap \phik-\phi\omega + G^{(k)} < G^{(\omega)})\ &=\mathbb P(\max(\thetak -\theta\omega, \phik-\phi\omega) + G^{(k)} < G^{(\omega)}) \end{align} $$
So, just as $\mathbb P (I=\omega)=\pi\omega=\frac{\exp(\theta{\omega})}{\sum_k \exp(\theta_k)}=\frac{1}{\sum_k \exp(\thetak-\theta\omega)}$, $\mathbb P (I=\omega\cap J=\omega)=\frac{1}{\sum_k \exp(\max(\thetak -\theta\omega, \phik-\phi\omega))}$ and $\mathbb P (I=\omega\cap J=\omega)\le \mathbb P (I=\omega)$ because $\max(\thetak -\theta\omega, \phik-\phi\omega) \ge \thetak -\theta\omega~\forall k\in[K]$ by definition and $\exp$ is monotonic.
As a sanity check, this takes care of the case where $\pi$ and $\rho$ are different probability distributions, but $\pi$ is the the probability distribution obtained from the same logits as $\rho$, except with a small temperature (i.e. a peakier version of the same distribution). This is because dividing by a small temperature $T$ will magnify the difference between logits and if omega is the largest logit, then all of the theta differences will be negative, so max(amplified negative number, negative number)=negative number and probability(peakier=omega®ular=omega)=probability(regular=omega). Thus, probability(peakier=omega|regular=omega)=probability(peakier=omega®ular=omega)/probability(regular=omega)=1.
Also, because $\ln \pi\omega=\theta\omega-\mathrm{LSE}(\theta_1, \ldots, \theta_K)$, the difference in logits is equal to the difference in their corresponding log probabilities, as the LSE constant terms will cancel out (this is easy to verify through examples: try it yourself!). Thus, if we know the two distributions $\pi$ and $\rho$ or their logits $\theta$ and $\phi$, we can directly compute both the joint and thus also the conditional probabilities of obtaining a particular token from the same Gumbel noise vectors.
Also, from [1], we can use stochastic beam search (i.e. use the Gumbel-max trick to sample from the distribution k times without replacement) if needed.
Thanks for the insightful discussions. I tried the new lookahead + sampling function in this PR with a very low temperature (like 0.01) to see if it does converge to greedy result. However, it doesn't give the expected output, unlike the original HF model. It's maybe due to numerical precision issues, or just an implementation bug?
To reproduce:
git clone https://github.com/knagrecha/LookaheadDecoding-sampling
git checkout abdae068f096bd608f6fbcd53a8c7c93b1125d05 # current latest commit
# run the modified minimal.py script below
python minimal.py
LOAD_LADE=1 USE_LADE=1 python minimal.py
The script modified from minimal.py
:
It simply:
temperature
torch_dtype
model_name
to a newer TinyLlamaHuggingFace baseline output, temperature = 0.01
gives the exact same result as greedy:
The lookahead output, gives quite different results every time:
I suspected it's a numerical issue because the (logits + gumbels) / tau
calculation in gumbel(logits, tau)
(tau = temperature
) goes to very large magnitude when temperature is close to zero. However switching to float32
or float64
also didn't help.
I suggest adding a unit test to verify that the added sampling function with temperature -> 0 does converge to greedy output.
Good catch of the numerical issue. The denominator will force the value closer to INF the lower it is. Might actually be a point in favor of reverting to the use of PyTorch’s gumbel softmax (which would handle this) rather than gumbel directly.
Good to see the speed up on sampling confirmed by someone else though (70vs 46)
Good to see the speed up on sampling confirmed by someone else though (70vs 46)
Yes, I tested on RTX 4090 with torch 2.0.1
It's not a numerical issue. I just tested it. I added a comment to fix the issue (in Yardi's code review, above).
The issue is that logits / tau + gumbels
should be returned, not (logits + gumbels) / tau
.
def gumbel(logits, tau):
gumbels = (
-torch.empty_like(logits, memory_format=torch.legacy_contiguous_format).exponential_().log()
) # ~Gumbel(0,1)
return (logits + gumbels) / tau # ~Gumbel(logits,tau)
Using (logits + gumbels) / tau
has the effect of sampling with temperature 1, regardless of the value of tau
passed.
minimal.py
output for TinyLlama:
Also, here are some random information theoretic experiments that are somewhat related to the above proofs.
Empirically, it seems that if $1-\epsilon\le\mathbf p\cdot \mathbf q\le 1$, then $\frac12 \lVert \mathbf p -\mathbf q\rVert_1 \le \epsilon$ (but not the other way around, the reverse bound seems a lot looser).
In the case where $\mathbf p_\omega = 1$, then we have equality. i.e. $\frac12 \lVert \mathbf p -\mathbf q\rVert1=\frac12 (p\omega-q\omega + \sum{k\neq \omega}(q_k-pk)) = \frac12 (1-q\omega + \sum_{k\neq \omega}(qk-0))= \frac12 (1-q\omega + 1-qk)=1-q\omega=1-\mathbf p \cdot\mathbf q$.
The general proof is actually quite similar. Let $A\subseteq [K]$ be the subset of indices where $p_k\ge q_k$. Then
$$ \begin{align} \frac12 \lVert \mathbf p -\mathbf q\rVert1&=\frac12 (\sum{k\in A}(p_k-qk)+\sum{k\not\in A}(q_k-pk))\ &=\frac12 (\sum{k\in A}pk+\sum{k\not\in A}qk-\sum{k\in A}qk-\sum{k\not\in A}pk)\ &\le\frac12 (1+1-\sum{k\in A}p_kqk-\sum{k\not\in A}p_kq_k)\ &=1-\mathbf p\cdot \mathbf q \end{align} $$
So if $1-\epsilon\le\mathbf p\cdot \mathbf q$ or $1-\mathbf p\cdot \mathbf q\le \epsilon$, then $\frac12 \lVert \mathbf p -\mathbf q\rVert_1 \le \epsilon$.
Similarly, suppose that the maximum difference between logits, or equivalently, the maximum difference between log probabilities is $M$. Then $\frac1M D_{KL}(P||Q)\le 1-\mathbf p\cdot \mathbf q$. For $k\in A$, $p_k\ge q_k$, so $\ln\frac{p_k}{q_k}\ge0$. (If $k\not\in A$, then it is negative).
(This proof is harder...)
Edit: Not true for some values of $\mathbf p$ and $\mathbf q$.
So, with iid Gumbel noise vectors for both distributions, you get
$P(I=J) = \mathbf p\cdot \mathbf q$
The following are probabilities for a single Jacobi decoding step. Obviously, due to the autoregressive nature of LLMs, subsequent probabilities depend on which tokens are accepted, which causes the math to not simplify to a nice closed form when considering more than 1 step.
For the sake of completion, I'll include the following n-gram statistics, but they follow from basic probability theory:
$\prod\limits_{i=1}^n \mathbf p^{(i)}\cdot \mathbf q^{(i)}$
$(1-\mathbf p^{(n+1)}\cdot \mathbf q^{(n+1)})\prod\limits_{i=1}^{n} \mathbf p^{(i)}\cdot \mathbf q^{(i)}$
$\sum\limits{n=1}^{N-1} [n(1-\mathbf p^{(n+1)}\cdot \mathbf q^{(n+1)})\prod\limits{i=1}^{n} \mathbf p^{(i)}\cdot \mathbf q^{(i)}]+N\prod\limits_{i=1}^N \mathbf p^{(i)}\cdot \mathbf q^{(i)}$
For the same $K$ iid Gumbel noise vectors for both distributions (e.g. because we are evaluating what would have happened if we had chosen to sample from $\mathbf p$ vs. $\mathbf q$ when using a pseudo-random number generator (PRNG)):
$P(I=J) = \sum_k \frac1{\sum_j \exp(\max(\theta_j-\theta_k, \phi_j-\phi_k))}= \sum_k \frac1{\sum_j \exp(\max(\ln p_j-\ln p_k, \ln q_j-\ln q_k))}$
Since $\exp$ is monotonic, $\sum_k \frac1{\sum_j \exp(\max(\ln p_j-\ln p_k, \ln q_j-\ln q_k))}=\sum_k \frac1{\sum_j \max( p_j/p_k, q_j/q_k)}$
The equivalent three n-gram statistics for this case are a trivial extension of the case where you have iid Gumbel noise vectors and is left as an exercise for the reader. 🤗
Now, suppose we want to validate if an $N$-gram is a valid completion. Now, suppose we have a sequence of $N$ Gumbel noise vectors, each of dimension $K$. Then we'll accept a completion sampled from an approximate distribution $Q$ if it is the same completion when sampled (using Gumbel sampling) from the true distribution $P$. Thus, we've reduced the problem of checking if the distributions are similar enough to checking if a sample from the approximate distribution is the same as a sample from the true distribution (I thought this is what we were doing all along, but I realized that the code doesn't actually implement this).
This means that we'll need to store the Gumbel values in-memory until we've verified that the token at a particular time step is accurate. Moreover, instead of using different Gumbel noise vectors each time (as the current implementation does), we'll need to use the same Gumbel noise vectors at each time step.
Obviously, this means that we might unnecessarily discard valid generations if the entropy of the next token distribution is high (i.e. close to being uniformly random). But this way, we have stronger theoretical guarantees for verification.
@knagrecha This idea isn't dead, right? I noticed that you haven't incorporated Yard1's suggestions into the code from a month ago.
Not dead, just my lack of skill with GitHub 😅. I hit the “commit” button on the comment and thought it would integrate.
That should do it for Yard1’s suggestion. I should stop handling my GitHub threads & commits on Chrome mobile…leads to accidental uncommitted changes like this.
Happy new year! 🎆
Also, $\frac1{\sum_j \max(p_j/p_k,q_j/q_k)}$ is a weird function. 1. What does it look like? And 2. how much better is it to use the same Gumbel noise vectors for each time step?
Remark: the paired next token equality probability $\sum_k\frac1{\sum_j \max(p_j/p_k,q_j/q_k)}$ kind of looks like a modified version of the harmonic mean... I also need a better name for the event of "next token equality", because that is a mouthful to say.
For now, let's examine the probability for the $k$'th token only.
We previously (https://github.com/hao-ai-lab/LookaheadDecoding/pull/6#issuecomment-1866930575) established that $\frac1{\sum_j \max(p_j/p_k,q_j/q_k)} \le p_k$ and $\le q_k$ (except when $p_k$ or $q_k$ = 0, but we'll just define it to be 0 for now), so it is $\le \min(p_k,q_k)$. Thus, it is also $\le 1$ by transitivity. Because probability vectors are non-negative, $0\le \frac1{\sum_j \max(p_j/p_k,q_j/q_k)}$, but can we give a better lower bound?
Minimizing $\frac1{\sum_j \max(p_j/p_k,q_j/qk)}$ is the same thing as maximizing its denominator, which is equivalent to $1 + \sum{k\neq j} \max(p_j/p_k,q_j/q_k)$. Because the L1 norms of probability vectors are equal to 1, to maximize the contribution of each sum, the supports of $p$ and $q$ over $[K]\setminus{k}$ must be disjoint.
In other words, $1 + \sum_{k\neq j} \max(p_j/p_k,q_j/q_k) \le 1 + \frac1{p_k}(1-p_k)+\frac1{q_k}(1-q_k)= \frac1{p_k}+\frac1{q_k} - 1$ (with equality when the supports of $p$ and $q$ over $[K]\setminus{k}$ are disjoint).
So, $\frac1{\frac1{p_k}+\frac1{q_k} - 1}\le \frac1{\sum_j \max(p_j/p_k,q_j/q_k)} \le \min(p_k,q_k)$ for when $p_k$ and $q_k$ are in $(0,1]$.
Thus, $\frac1{\sum_j \max(p_j/p_k,q_j/q_k)}$ is a smooth approximator of $\min(p_k,q_k)$ with equality at the boundary conditions where at least one of $p_k$ or $q_k$ lie in $\{0,1\}$ (For the lower bound, if either $p_k$ or $q_k$ is equal to 0, we take the limit, but if they are both 0, then we set it to 0 for convenience (as the limit DNE). We then use the Squeeze Theorem.).
As it turns out, the equality probability under unpaired Gumbel sampling is also a smooth approximator of $\min(p_k,q_k)$ with equality at the boundary square formed when $|\{p_k,q_k\}\cap\{0,1\}|>0$.
So how much better is paired sampling than unpaired sampling, really?
So, how much better is paired sampling than unpaired sampling (which has probability $p_k q_k$)? We can take the ratio of probabilities to find out.
Since $\min(p_k,q_k)\max(p_k,q_k)=p_kq_k$, $\frac{p_kq_k}{\min(p_k,q_k)}=\max(p_k,q_k)$.
$p_kq_k (\frac1{p_k}+\frac1{q_k} - 1)=p_k+q_k-p_kq_k$.
So, the ratio of probabilities between using unpaired vs paired Gumbel sampling is roughly equal to $\max(p_k,q_k)$.
But what about the ratio when $p_k=q_k$? Then the ratio curve isn't linear for the lower bound ratio as it is for the upper bound ratio, but quadratic instead.
So it seems like even for unpaired sampling, the approximation tends to be surprisingly good, even if not perfect.
So in summary, there are 2 advantages for using the same Gumbel noise vectors at each time step:
Next steps: Show that there is an empirical speed boost when using the same Gumbel noise vectors at each time step.
(Also, do human pairwise preferences between approximate and true completions indicate that true completions are better than approximate ones, even at higher temperatures?)
I made the following changes to this PR:
Sampling is supported in a rejection sampling way like previous work SpecInfer. Gumbel can be a good method for in place sampling. Close this thread now. And may reopen one for in place support of Gumbel.
Quick test of gumbel softmax sampling instead of decoding. Need to add args to make it configurable, e.g., for temperature, but this PR allows to quickly see if it works as expected.