Gumbel noise for sampling in-place of greedy decoding

hao-ai-lab / LookaheadDecoding

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

https://arxiv.org/abs/2402.02057

Apache License 2.0

1.11k stars 66 forks source link

Gumbel noise for sampling in-place of greedy decoding #6

Closed knagrecha closed 4 months ago

knagrecha commented 10 months ago

Quick test of gumbel softmax sampling instead of decoding. Need to add args to make it configurable, e.g., for temperature, but this PR allows to quickly see if it works as expected.

knagrecha commented 10 months ago

All credit to Yao Fu for idea @FranxYao

lxuechen commented 10 months ago

Just saw this, as I thought of the exact idea yesterday. Minor: the softmax operation at the end isn't required -- just the max of the Gumbel perturbed logits.

knagrecha commented 10 months ago

Just saw this, as I thought of the exact idea yesterday. Minor: the softmax operation at the end isn't required -- just the max of the Gumbel perturbed logits.

Good point, and yes we should change this to just perturbation. I used the gumbel softmax for a quick test since it was easily available as a pre-defined op but the softmax is just pointless overhead. In addition, we should ignore this altogether if temperature is set to 0 and just go to regular argmax.

knagrecha commented 10 months ago

User can now configure temperature in the generate function. @Viol2000 let me know what you think. I also updated minimal.py to add an example of sampling. From my quick test it seems to maintain the speedup in the sampling setting too, about 1.46X faster throughput with lade on vs without.

Viol2000 commented 10 months ago

Hi @knagrecha , thanks for all your efforts in implementing the sampling method!

From my point of view, Gumbel is a brilliant method for sampling, but I wonder whether it can help if we want to maintain the output sampling distribution in the same way as the method without using lookahead decoding. I think we need to do some rejection sampling in the verification branch to keep the output distribution, like speculative decoding. A sampling in the lookahead branch (i.e., variable past_tokens) does not seem perfect to maintain the output distribution.

What's your opinion? Let's discuss!

knagrecha commented 10 months ago

Hi @Viol2000, thanks for the quick response. Let me get back to you after I think about some proof for equivalency between the two methods (or if such a proof is not possible, then a modification of the implementation that will allow such a proof). I agree that from a research standpoint, maintaining the distribution is important, otherwise we change the semantics of "sampling". Could you share any formalization you have demonstrating that look-ahead decoding with n-gram is equivalent for the greedy decoding case? This may help as a starting point.

From a product/OSS tool perspective, I think it is important to have some kind of sampling support in place for lookahead decoding adoption. Anecdotally, most of the real-world LLM deployments I have seen prefer sampling over greedy decoding. I wonder if we can merge this as temporary support for sampling, and note in the repo that "we hope to verify and support mathematically equivalent sampling soon".

I imagine that for some large number of users, the actual functionality of temperature-modified sampling is more critical than maintaining the distribution. But I also see a risk of confounding users with misunderstandings if the change is not surfaced properly. What are your thoughts?

shermansiu commented 10 months ago

@knagrecha, I saw this an hour and a half ago when you pinged the issue on huggingface/transformers.

As mentioned in https://github.com/vllm-project/vllm/issues/1742, I'm still concerned about experimentally verifying that the approximate distributions are of high enough quality (i.e. the KL-divergences between the true and approximate conditional next token distributions are low enough).

I spent the past bit mathing it out:

For the sake of brevity, let $[K]$ represent the integer interval $\{1, \ldots, K\}$.

Suppose $P\sim Cat(K, \mathbf{p}), Q\sim Cat(K, \mathbf{q})$. Let $\mathbf x$ and $\mathbf y$ be the logits of $\mathbf p$ and $\mathbf q$, respectively.

Fix the Gumbel seed $z_k\overset{i.i.d.}{\sim} Gumbel(0,1)$, $k\in [K]$. Let $M = \mathrm{argmax}_{k\in [K]}(x_k+z_k)$ and $L = \mathrm{argmax}_{k\in [K]}(y_k+z_k)$.

What is $\mathbb P(M=L)$? Let $\sigma(\cdot)$ denote the softmax function.

$$ \begin{align} \mathbb P(M=L)&=\sum{k\in [K]}^K \mathbb P(M=k) \mathbb P(L=k)\ &=\sum{k\in [K]}^K\sigma_k(\mathbf x) \sigma_k(\mathbf y)\ &=\mathbf p \cdot \mathbf q\ \end{align} $$

Suppose $D{KL}(P||Q)\le \epsilon$. (Recall that $D{KL}(P||Q)\ge 0$ by Jensen's inequality). What is a corresponding lower bound for $\mathbb P(M=L)$? Since $D{KL}(P||Q)\le \epsilon$, $\sum\limits{k\in [K]}^K p_k\log(\frac{p_k}{q_k}) \le \epsilon$ and as $\mathbf p$ sums to 1, $\log(\frac{p_k}{q_k})\le \epsilon$ for all $k$. Rearranging, we get $\frac{p_k}{e^\epsilon}\le q_k$. We can now derive the following inequality:

$$ \begin{align} \mathbf p \cdot \mathbf q\ge\mathbf p \cdot (\frac{1}{e^\epsilon}\mathbf p)\ge\frac{1}{e^\epsilon} \end{align} $$

$\mathbb P(M=L)\ge\frac{1}{e^\epsilon}$.

But since $D(P||Q)=\frac{1}{\log 2}D_{KL}(P||Q)\ge\frac{1}{2\log 2}\lVert\mathbf p-\mathbf q\rVert_1^2$[^1], $\epsilon$ is also lower-bounded accordingly.

I'm not sure if the notation is the best, but I think I got the point across.

[^1]: Also known as Pinsker's inequality. Cover and Thomas (1991, Lemma 12.6.1, pp. 300–301. Also see Lemma 361 here from Shalizi (2006).

shermansiu commented 10 months ago

... And I just realized I spent an hour typing it out. 🤣

shermansiu commented 10 months ago

For sampling, token equivalence is not necessary for acceptance if the KL-divergence between the true and approximate distributions is small (but it certainly is a lot faster to compute!). I just hope that for high entropy distributions (where the probabilities are closer together), that the token acceptance rate won't drop too much! (Obviously, the empirical frequency of encountering high entropy distributions needs to be considered, especially at common sampling temperatures).

You could say I'm a bit concerned about the tradeoffs between text quality and speed.

But then again, what matters most is whether the text generated from approximate sampling is as preferable as that from true sampling, as judged by a human reader.

shermansiu commented 10 months ago

Also, regarding rejection sampling... Hopefully there can be some nice concentration equalities showing that the approximate distribution is "good enough" if you have enough samples?

shermansiu commented 10 months ago

In general, I'm skeptical whether using lookahead decoding for sampling would be effective.

Using a diffusion language model with adversarial diffusion distillation (to reduce the number of decoding steps to 1) might work, but is still largely unexplored territory.

knagrecha commented 10 months ago

Hi @shermansiu, thanks for all the great efforts in putting together this mathematical formalization. However, I have some doubts on a few of the steps.

$$ \begin{align} \mathbb P(M=L)&=\sum_{k\in [K]}^K \mathbb P(M=k) \mathbb P(L=k) \end{align} $$

In this line, you express the total probability that the argmax of both distributions is equal to the same value. However this breakdown of the probability is only be possible if M and L are independent, which is unlikely since they both use the same Gumbel noise vectors.

I also assume for the next step to work, that p and q are the softmax-transformed logits (not the raw logits).

Second half

This part also seems problematic to me:

Since $D{KL}(P||Q)\le \epsilon$, $\sum\limits{k\in [K]}^K p_k\log(\frac{p_k}{q_k}) \le \epsilon$ and as $\mathbf p$ sums to 1, $\log(\frac{p_k}{q_k})\le \epsilon$ for all $k$

Let me give a very simplistic counterexample where this inference about the summation would not work. Ignore whether the values are realistic, just a demonstration.

$$p_0 = 0.9, p_1 = 0.1, q_0 = 0.8, q_1 = 0.2.$$ $$\epsilon=0.1$$

$$D_{KL}(p||q) = 0.9 \log(\frac{0.9}{0.8}) + 0.1 \log(\frac{0.1}{0.2}) \approx 0.0367$$

$$\log(\frac{p_0}{q_0})=\log(\frac{0.9}{0.8}) \approx 0.1178$$

$$\log(\frac{p_1}{q_1}) = \log(\frac{0.1}{0.2}) \approx -0.693$$

Notice that $0.11768 > \epsilon$, thus showing that this summation property does not hold.

These counterexamples do not prove any equivalence of course...only that the final inequality & lower bound derived may not be correct.

knagrecha commented 10 months ago

From my perspective, the effectiveness of Gumbel sampling on its own is less of an issue; it has already become a standard approximation option for sampling transformations in prior works [1,2,3]. The point of equivalence between temperature-configured softmax sampling and Gumbel-softmax is important, but at least we can draw on prior examples there.

What I think may be a bigger issue is the interaction between sampling and n-gram selection, rather than single-token selection. This introduces another potentially complicating factor. In other words, the major problem is more specific to lookahead decoding (and its n-gram design) than the general effectiveness of Gumbel-softmax.

[1] https://dl.acm.org/doi/10.1145/3604915.3608779 [2] https://arxiv.org/pdf/2103.08862.pdf [3] https://aclanthology.org/P19-2049.pdf

shermansiu commented 10 months ago

From my perspective, the effectiveness of Gumbel sampling on its own is less of an issue; it has already become a standard approximation option for sampling transformations in prior works [1,2,3].

I'm not debating the effectiveness of Gumbel sampling at all. It first appeared in [1] and I found out about it a year after its publication. (Edit: I found an even earlier work from 2013[2] that mentions it frequently being used as a trick). (Edit 2: There's also Gumbel's original lecture transcripts from 1954[3]).

Assuming that you're using token equivalence as the token acceptance criterion, my goal is to relate the probability of token acceptance to the KL-divergence between the true (i.e. what the probability distribution we would be sampling from if using regular auto-regressive sampling) and approximate (i.e. the probability distribution obtained from looking ahead several steps in the future) next token distributions.

Basically, the probability distribution when we look ahead is different than the one we would obtain without lookahead sampling and this could cause problems.

What I think may be a bigger issue is the interaction between sampling and n-gram selection, rather than single-token selection. This introduces another potentially complicating factor. In other words, the major problem is more specific to lookahead decoding (and its n-gram design)

I agree, you'd be looking at the n-gram acceptance. But the probability of n-gram acceptance depends on the probability of single-token acceptance.

The reason why I'm focusing on single-token acceptance is because the probability of accepting an n-gram is less than the probability of accepting its first token (by the chain rule of probabilities and how each probability is between 0 and 1). If we have problems with accepting the first token when the true and lookahead distributions are only slightly different, then well, we have a problem.

than the general effectiveness of Gumbel-softmax.

I want to repeat that I'm not criticizing its effectiveness at all. (In fact, I agree it has been standard for many years at this point...) Just trying to calculate a bound on single-token acceptance probabilities when we have constraints on the KL-divergence.

[1] https://arxiv.org/abs/1611.01144 [2] https://lips.cs.princeton.edu/the-gumbel-max-trick-for-discrete-distributions/ [3] https://ntrl.ntis.gov/NTRL/dashboard/searchResults/titleDetail/PB175818.xhtml

shermansiu commented 10 months ago

is only be possible if M and L are independent,

Interestingly enough, the question of whether they are independent boils down to whether the values of $x_k$ and $y_k$ are independent.

For notational convenience, let $a_k = x_k+z_k$ and $b_k = y_k+z_k$.

Because we select $M$ through the argmax over each $a_k$, $\mathbb{P}(M=k) = \mathbb{P}(a_k\ge a_i, \forall i\neq k)$.

$\mathbb{P}(M=L)= \sum\limits_{k\in [K]}^K \mathbb P(M=k \wedge L=k)$

Let's examine the inner term more closely. $\mathbb P(M=k \wedge L=k) = \mathbb P(M=k | L=k)\mathbb P (L=k)$.

$\mathbb P(M=k | L=k) = \mathbb P(a_k\ge a_i, \forall i\neq k | b_k\ge b_i, \forall i\neq k)$

If the values of $x_k$ and $y_k$ don't probabilistically depend on each other, then $\mathbb P(M=k | L=k)=\mathbb P(M=k)$.

In my original formulation, I didn't constrain p and q to necessarily come from the language model itself: they are just two probability vectors. So the original result, given two arbitrary (constant) probability vectors, still stands. (Recall: a probability vector is a non-negative real vector that sums to 1.)

But in real life, $\mathbf p$ and $\mathbf q$ do depend on each other.

Suppose, WLOG, that $x_k$ is the true next token probability and $yk$ is approximate. Suppose we have an autoregressive language model $f\theta$ parameterized by weights $\theta$ that takes in tokens and returns a next token probability vector. Suppose that we have string of tokens $\mathbf s$ and $\mathbf t$, where they share a common prefix $\mathbf t{1:m-1}$ and $\mathbf s=\mathbf t{1:m-1}\mathbf s_{m:\ell}$. Both $\mathbf s$ and $\mathbf t$ are of length $\ell$. Suppose that $t$ is the target token sequence.

If we are looking at sampling the $\ell$-th token, we could set $\mathbf p=f\theta(\cdot|\mathbf t{1:\ell-1})$ and $\mathbf q=f\theta(\cdot|\mathbf s{1:\ell-1})$. But obviously, both values depend on $f\theta(\cdot|\mathbf t{1:m-1})$ (because the tokens are generated autoregressively, so you use the chain rule to decompose the product of probabilities to find the common multiplicative factor).

which is unlikely since they both use the same Gumbel noise vectors.

It actually depends on the probabilistic dependence between the true and approximate logits, not on whether they use the same Gumbel noise vectors. If the values of $x_k$ didn't depend on $y_k$, then $\mathbb P(M=k | L=k)= \mathbb P(a_k\ge a_i, \forall i\neq k | b_k\ge b_i, \forall i\neq k)= \mathbb P(a_k\ge a_i, \forall i\neq k)=\mathbb P(M=k)$, which would lead to the result above.

In fact, if $\mathbf p = \mathbf q$, then $\mathbf{P}(M=L) = \mathbf p\cdot\mathbf p = 1$ as expected. So no, the fact that they use the same Gumbel noise vectors is already taken into consideration.

I also assume for the next step to work, that p and q are the softmax-transformed logits (not the raw logits).

Yes, $\mathbf p$ and $\mathbf q$ are the softmax-transformed logits (i.e. the categorical event probability vectors). But it turns out that $\mathbf x$ and $\mathbf y$ are the log probabilities, not the logits.

I noticed that in your blog, $\mathbf x$ and $\mathbf y$ are used to refer to the tokens... Sorry for any confusion there.

Second half

This part also seems problematic to me:

Since $D{KL}(P||Q)\le\epsilon$, $\sum\limits{k\in[K]}^K p_k\log⁡(\frac{p_k}{q_k})\le\epsilon$ and as $\mathbf p$ sums to 1, $\log⁡(\frac{p_k}{q_k})\le\epsilon$ for all $k$

You're correct here. I made three mistakes in my chain of thought here:

It should be $\sum\limits_{k\in[K]}^K p_k\log⁡(\frac{p_k}{qk})\le\sum\limits{k\in[K]}^K \log⁡(\frac{p_k}{q_k})$ (Edit: Also wrong. It's $0\le p_k\log⁡(\frac{p_k}{q_k})\le \log⁡(\frac{p_k}{q_k})$ if the log is non-negative. Otherwise, $\log⁡(\frac{p_k}{q_k})\le p_k\log⁡(\frac{p_k}{q_k}) < 0$. Inequalities that involve negative numbers must be handled carefully.)
Logs are not non-negative, so $\sum\limits_{k\in[K]}^K \log⁡(\frac{p_k}{q_k})\le \epsilon$ does not imply that they are all $\le \epsilon$ (because if you have both negative and positive terms, their sum can "cancel out" the magnitude of the largest term).
The inequality is in the other direction. i.e. I want something $\le \sum\limits_{k\in[K]}^K p_k\log⁡(\frac{p_k}{q_k})$, rather than the other way around.

Coming up with a tight lower bound needs more work.

shermansiu commented 10 months ago

Also, regarding the probability of n-gram equality... Obviously, we know that the first tokens ($s_m$ and $tm$) are identical, because the inputs to the language model are equivalent for both. So $\mathbf s{1:m}=\mathbf t_{1:m}$. If $\ell$ is the index of the last token of the candidate n-gram, then $n=\ell-m+1$.

So $\mathbb{P}(\mathbf s=\mathbf t) = \mathbb{P}(\mathbf s{m+1:\ell}=\mathbf t{m+1:\ell})=\mathbb{P}(s{\ell}=t{\ell}|\mathbf s{m+1:\ell-1}=\mathbf t{m+1:\ell-1})\mathbb{P}(s{\ell-1}=t{\ell-1}|\mathbf s{m+1:\ell-2}=\mathbf t{m+1:\ell-2})\cdots\mathbb{P}(\mathbf s{m+1:\ell}=\mathbf t{m+1:\ell})$.

So assuming that the token acceptance criterion is token equivalence (I actually don't know how you've decided to accept tokens when sampling under Gumbel-Max trick... I'm just assuming you're checking if the n-grams are equal instead of comparing KL-divergences, which is costly), the probability for accepting an n-gram depends on (and is upper-bounded by) the probability of accepting the first non-trivial token (i.e. the second token). In other words, $\mathbb{P}(\mathbf s=\mathbf t)\le \mathbb{P}(\mathbf s{m+1:\ell}=\mathbf t{m+1:\ell})$.

Let's call the claim that $\mathbb{P}(M=L) = \mathbf p\cdot \mathbf q$, Lemma $\star$. Because lemma $\star$ does not assume where the probability vectors come from, we can actually apply it to each of the terms in the above product (obviously, some work would be required to show that).

shermansiu commented 10 months ago

It turns out that for the lower bound, you can use Pinsker's inequality. Suppose $D_{KL}(P||Q) \le \epsilon$.

First, recall that for any two probability vectors, their inner product is at most 1 (by Cauchy-Schwartz, L2 norm $\le$ L1 norm, L1 norm is 1 for probability vectors) and is equal to 1 iff the two vectors are equal.

By Pinsker's inequality, $\frac12\lVert \mathbf p-\mathbf q\rVert_1^2\le \epsilon$. We rearrange this to $\lVert \mathbf p-\mathbf q\rVert_1\le \sqrt{2\epsilon}$.

$$ \begin{align} \lVert \mathbf p\rVert2^2-\mathbf p\cdot \mathbf q&=\mathbf p\cdot(\mathbf p-\mathbf q)\ &\le \mathbf p\cdot|\mathbf p-\mathbf q|\ &\le \sum{k\in[K]}^K|p_k-q_k|\ &=\lVert \mathbf p-\mathbf q\rVert_1\ &\le \sqrt{2\epsilon} \end{align} $$

We rearrange this inequality to obtain: $\lVert \mathbf p\rVert_2^2-\sqrt{2\epsilon} \le\mathbf p\cdot \mathbf q=\mathbb{P}(M=L)$. Due to symmetry, we can have $\max(\lVert \mathbf p\rVert_2^2,\lVert \mathbf q\rVert_2^2)-\sqrt{2\epsilon} \le\mathbf p\cdot \mathbf q=\mathbb{P}(M=L)$. (Note that $\frac{1}{n}\le\lVert\mathbf p\rVert_2^2 \le 1$).

For this inequality to have any guarantees, $D{KL}(P||Q)$ must be below $\frac12$ and for us to guarantee at least a 50% probability of token acceptance, we need $D{KL}(P||Q)$ to be below $\frac18$, which is a tough call.

Edit: This only is correct up to the point where Lemma $\star$ is used (i.e. $\mathbf p\cdot \mathbf q = \mathbb{P}(M=L$)) because Lemma $\star$ only works if the Gumbel noise vectors are different iid vectors, not the same exact ones. It's valid if we're using different iid Gumbel noise vectors though.

shermansiu commented 10 months ago

Perhaps I need to re-examine the claim that $\mathbb{P}(M=k)$ and $\mathbb{P}(L=k)$ are independent.

Suppose we know $\mathbf x$ and $\mathbf y$ ahead of time.

Recall: $\mathbb P(M=k | L=k) = \mathbb P(a_k\ge a_i, \forall i\neq k | b_k\ge b_i, \forall i\neq k)$. $\mathbb P(a_k\ge a_i, \forall i\neq k | b_k\ge b_i, \forall i\neq k)=\mathbb P(a_k\ge a_i, \forall i\neq k)$ iff they are independent.

Consider the following system of $K-1$ inequalities of the form $b_k-b_i\ge 0, \forall i\neq k$. Since $x_k$ and $y_k$ are known for all $k\in[K]$, we can attempt to derive inequalities of the form $a_k-ai\ge \Delta{xy}^{(i)}, \forall i\neq k$, where $\Delta_{xy}^{(i)}=x_k-y_k-x_i+yi$. If we can confirm that $\Delta{xy}^{(i)}\ge 0$ for all $i\neq k$, then obviously $M=k$. So in some cases, we can determine with certainty, whether $\mathbb{P}(M=k)$. Therefore, it's not independent from $\mathbb{P}(L=k)$.

...But then you might ask, since $\mathbf x$ and $\mathbf y$ are log probabilities (and not logits in $\mathbb{R}^K$), don't $\exp(\mathbf x)$ and $\exp(\mathbf y)$ lie on the standard $K-1$-simplex? So how can we guarantee that it's even possible for $\Delta_{xy}^{(i)}\ge 0$ for all $i\neq k$?

I need to stop going down this rabbit hole.

It turns out that if you add $\log \sum_{k} e^{p_k}$ (i.e. a constant) to $x_k$, you get the logits. And if the normalization constant is equal to 1, then the log probabilities are the logits. I suppose for a large part of what I said earlier, we can say WLOG, that $\mathbf x$ and $\mathbf y$ are actually the logits. (Especially when you have an inequality where it appears on both sides of the inequality, because then you'd just need to add a constant term to both sides).

I'm going to lay off on the theoretical side until we can experimentally demonstrate that lookahead decoding with sampling actually leads to speedups.

shermansiu commented 9 months ago

Lemma $\star$ only works for iid Gumbel noise vectors, not the same ones. knagrecha is correct: $\mathbf p \cdot \mathbf q$ is the probability assuming that the events are independent (i.e. with different Gumbel noise vectors). Also, $\mathbf p \cdot \mathbf p < 1$ unless one of the components is equal to 1 (probability vectors have a L1 norm of 1, not a L2 norm of 1).

$\mathbf x$ and $\mathbf y$ are softmax logits, which are unconstrained in $\mathbb R$, unlike sigmoid logits, because the other components are also unconstrained in $\mathbb R$. For sigmoid logits, there is an implicit extra logit with value 0 (recall the sigmoid function is $\frac{e^x}{e^0 + e^x}$), which limits the values that $x$ can take. A softmax logit is only constrained if both the corresponding probability and all of the other logits are known.

If we are to use Weinberger's notation for the proof of the Gumbel-max trick, then

$$ \mathbb P (I=\omega\cap J=\omega)=\mathbb E_M\mathbb EN [\mathbb P(G{\thetak} < M~\forall k\neq\omega \cap G{\phi_k} < N~\forall k\neq\omega )] $$

and because each Gumbel variable is independent, we can write the above probability as:

$$ \mathbb P (I=\omega\cap J=\omega)=\mathbb E_M\mathbb EN [\prod{k\neq \omega} \mathbb P(G_{\thetak} < M \cap G{\phi_k} < N)] $$

We can rewrite the factor in the above equation as follows:

$$ \begin{align} \mathbb P(G_{\thetak} < M \cap G{\phi_k} < N)&=\mathbb P(\thetak + G^{(k)} < \theta\omega + G^{(\omega)} \cap \phik + G^{(k)} < \phi\omega + G^{(\omega)})\ &=\mathbb P(\thetak -\theta\omega + G^{(k)} < G^{(\omega)} \cap \phik-\phi\omega + G^{(k)} < G^{(\omega)})\ &=\mathbb P(\max(\thetak -\theta\omega, \phik-\phi\omega) + G^{(k)} < G^{(\omega)}) \end{align} $$

So, just as $\mathbb P (I=\omega)=\pi\omega=\frac{\exp(\theta{\omega})}{\sum_k \exp(\theta_k)}=\frac{1}{\sum_k \exp(\thetak-\theta\omega)}$, $\mathbb P (I=\omega\cap J=\omega)=\frac{1}{\sum_k \exp(\max(\thetak -\theta\omega, \phik-\phi\omega))}$ and $\mathbb P (I=\omega\cap J=\omega)\le \mathbb P (I=\omega)$ because $\max(\thetak -\theta\omega, \phik-\phi\omega) \ge \thetak -\theta\omega~\forall k\in[K]$ by definition and $\exp$ is monotonic.

As a sanity check, this takes care of the case where $\pi$ and $\rho$ are different probability distributions, but $\pi$ is the the probability distribution obtained from the same logits as $\rho$, except with a small temperature (i.e. a peakier version of the same distribution). This is because dividing by a small temperature $T$ will magnify the difference between logits and if omega is the largest logit, then all of the theta differences will be negative, so max(amplified negative number, negative number)=negative number and probability(peakier=omega&regular=omega)=probability(regular=omega). Thus, probability(peakier=omega|regular=omega)=probability(peakier=omega&regular=omega)/probability(regular=omega)=1.

shermansiu commented 9 months ago

Also, because $\ln \pi\omega=\theta\omega-\mathrm{LSE}(\theta_1, \ldots, \theta_K)$, the difference in logits is equal to the difference in their corresponding log probabilities, as the LSE constant terms will cancel out (this is easy to verify through examples: try it yourself!). Thus, if we know the two distributions $\pi$ and $\rho$ or their logits $\theta$ and $\phi$, we can directly compute both the joint and thus also the conditional probabilities of obtaining a particular token from the same Gumbel noise vectors.

shermansiu commented 9 months ago

Also, from [1], we can use stochastic beam search (i.e. use the Gumbel-max trick to sample from the distribution k times without replacement) if needed.

[1] https://arxiv.org/abs/1903.06059

learning-chip commented 9 months ago

Thanks for the insightful discussions. I tried the new lookahead + sampling function in this PR with a very low temperature (like 0.01) to see if it does converge to greedy result. However, it doesn't give the expected output, unlike the original HF model. It's maybe due to numerical precision issues, or just an implementation bug?

To reproduce:

git clone https://github.com/knagrecha/LookaheadDecoding-sampling
git checkout abdae068f096bd608f6fbcd53a8c7c93b1125d05  # current latest commit
# run the modified minimal.py script below
python minimal.py
LOAD_LADE=1 USE_LADE=1 python minimal.py

The script modified from minimal.py:

```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch import time import os if int(os.environ.get("LOAD_LADE", 0)): import lade lade.augment_all() lade.config_lade(LEVEL=7, WINDOW_SIZE=20, GUESS_SET_SIZE=20, DEBUG=1) assert torch.cuda.is_available() torch_device = "cuda" torch_dtype = torch.float16 model_name = "TinyLlama/TinyLlama-1.1B-Chat-v0.6" temperature = 0.01 tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch_dtype, device_map=torch_device) model.tokenizer = tokenizer prompt = "How do you fine tune a large language model?" input_text = ( f"<|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n" ) model_inputs = tokenizer(input_text, return_tensors='pt').to(torch_device) #warm up greedy_output = model.generate(**model_inputs, max_new_tokens=1, do_sample=False) #end warm up # generate 256 new tokens torch.cuda.synchronize() t0 = time.time() greedy_output = model.generate(**model_inputs, max_new_tokens=256, do_sample=False) torch.cuda.synchronize() t1 = time.time() torch.cuda.synchronize() t2 = time.time() sampled_output = model.generate(**model_inputs, max_new_tokens=256, do_sample=True, temperature=temperature) torch.cuda.synchronize() t3 = time.time() print("Greedy Output:\n" + 100 * '-') print(tokenizer.decode(greedy_output[0], skip_special_tokens=False)) print("Generated Tokens:", (greedy_output.numel() - model_inputs['input_ids'].numel()) ,"Generation Speed: ", (greedy_output.numel() - model_inputs['input_ids'].numel()) / (t1 - t0), " tokens/s") print() print(f"Sampled Output [temperature = {temperature}]:\n" + 100 * '-') print(tokenizer.decode(sampled_output[0], skip_special_tokens=False)) print("Generated Tokens:", (sampled_output.numel() - model_inputs['input_ids'].numel()) ,"Generation Speed: ", (sampled_output.numel() - model_inputs['input_ids'].numel()) / (t3 - t2), " tokens/s") ```

It simply:

configures temperature
configures torch_dtype
sets model_name to a newer TinyLlama

HuggingFace baseline output, temperature = 0.01 gives the exact same result as greedy:

``` python minimal.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Greedy Output: ---------------------------------------------------------------------------------------------------- <|im_start|>user How do you fine tune a large language model?<|im_end|> <|im_start|>assistant To fine tune a large language model, you need to make several adjustments to the parameters of the model. Here are the steps to follow: 1. Choose the right dataset: The first step is to choose the right dataset for your language model. You can use a large dataset like the IMDB movie review dataset or the Google News dataset. 2. Choose the right model: The second step is to choose the right model for your language model. There are several models available in the NLP community, such as BERT, GPT-2, and XLM-R. 3. Adjust the learning rate: The learning rate is the rate at which the model updates its parameters. You can adjust the learning rate by tuning the hyperparameters of the model. 4. Adjust the regularization: The regularization is the process of reducing the model's complexity by adding constraints or regularization terms. You can adjust the regularization by tuning the hyperparameters of the model. 5. Monitor the performance: Monitor the performance of the model by using metrics like accuracy, precision, and recall. You can use tools like TensorBoard or Sklearn's metrics to monitor the model Generated Tokens: 256 Generation Speed: 46.21632607647198 tokens/s Sampled Output [temperature = 0.01]: ---------------------------------------------------------------------------------------------------- <|im_start|>user How do you fine tune a large language model?<|im_end|> <|im_start|>assistant To fine tune a large language model, you need to make several adjustments to the parameters of the model. Here are the steps to follow: 1. Choose the right dataset: The first step is to choose the right dataset for your language model. You can use a large dataset like the IMDB movie review dataset or the Google News dataset. 2. Choose the right model: The second step is to choose the right model for your language model. There are several models available in the NLP community, such as BERT, GPT-2, and XLM-R. 3. Adjust the learning rate: The learning rate is the rate at which the model updates its parameters. You can adjust the learning rate by tuning the hyperparameters of the model. 4. Adjust the regularization: The regularization is the process of reducing the model's complexity by adding constraints or regularization terms. You can adjust the regularization by tuning the hyperparameters of the model. 5. Monitor the performance: Monitor the performance of the model by using metrics like accuracy, precision, and recall. You can use tools like TensorBoard or Sklearn's metrics to monitor the model Generated Tokens: 256 Generation Speed: 46.0119918769736 tokens/s ```

The lookahead output, gives quite different results every time:

``` LOAD_LADE=1 USE_LADE=1 python minimal.py Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ==========================ACCELERATION===SUMMARY====================================== Generated tokens: 1 Total steps: 1 Compression ratio: 1.0 ====================================================================================== ==========================ACCELERATION===SUMMARY====================================== Generated tokens: 256 Total steps: 142 Compression ratio: 1.8 ====================================================================================== ==========================ACCELERATION===SUMMARY====================================== Generated tokens: 256 Total steps: 146 Compression ratio: 1.75 ======================================================================================Greedy Output: ---------------------------------------------------------------------------------------------------- <|im_start|>user How do you fine tune a large language model?<|im_end|> <|im_start|>assistant To fine tune a large language model, you need to make several adjustments to the parameters of the model. Here are the steps to follow: 1. Choose the right dataset: The first step is to choose the right dataset for your language model. You can use a large dataset like the IMDB movie review dataset or the Google News dataset. 2. Choose the right model: The second step is to choose the right model for your language model. There are several models available in the NLP community, such as BERT, GPT-2, and XLM-R. 3. Adjust the learning rate: The learning rate is the rate at which the model updates its parameters. You can adjust the learning rate by tuning the hyperparameters of the model. 4. Adjust the regularization: The regularization is the process of reducing the model's complexity by adding constraints or regularization terms. You can adjust the regularization by tuning the hyperparameters of the model. 5. Monitor the performance: Monitor the performance of the model by using metrics like accuracy, precision, and recall. You can use tools like TensorBoard or Sklearn's metrics to monitor the model Generated Tokens: 256 Generation Speed: 71.48308842047702 tokens/s Sampled Output [temperature = 0.01]: ---------------------------------------------------------------------------------------------------- <|im_start|>user How do you fine tune a large language model?<|im_end|> <|im_start|>assistant There are several ways to fine tune a large language model: 1. Data augmentation: incorporating real-world data to improve the model's training data 2. Hyperparameter tuning: adjusting the model's hyperparameters, such as learning rate, batch size, and number of epochs, to improve performance 3. Model compression: reducing the number of parameters while still reducing the model's capacity to learn new concepts. 4. Model fine-tuning: fine-tuning the model on the target language to tailor it to the specific target. 5. Model evaluation: using different evaluation metrics, such as accuracy, flapping exceptions, or word error rates, to assess the model's performance. In general, to fine tune a large language model, you will want to analyze the data used in the training process and adjust your hyperparameters accordingly. Some common factors to consider when fine-tuning include: 1. Number of epochs: the number of training epochs to run over the dataset 2. Learning rate: the regularization rate used in the training process 3. Batch size: the size of the batch used in the training process 4. Model compression: Generated Tokens: 256 Generation Speed: 70.38865874872627 tokens/s ```

I suspected it's a numerical issue because the (logits + gumbels) / tau calculation in gumbel(logits, tau) (tau = temperature) goes to very large magnitude when temperature is close to zero. However switching to float32 or float64 also didn't help.

I suggest adding a unit test to verify that the added sampling function with temperature -> 0 does converge to greedy output.

knagrecha commented 9 months ago

Good catch of the numerical issue. The denominator will force the value closer to INF the lower it is. Might actually be a point in favor of reverting to the use of PyTorch’s gumbel softmax (which would handle this) rather than gumbel directly.

knagrecha commented 9 months ago

Good to see the speed up on sampling confirmed by someone else though (70vs 46)

learning-chip commented 9 months ago

Good to see the speed up on sampling confirmed by someone else though (70vs 46)

Yes, I tested on RTX 4090 with torch 2.0.1

shermansiu commented 9 months ago

It's not a numerical issue. I just tested it. I added a comment to fix the issue (in Yardi's code review, above).

shermansiu commented 9 months ago

The issue is that logits / tau + gumbels should be returned, not (logits + gumbels) / tau.

def gumbel(logits, tau): gumbels = ( -torch.empty_like(logits, memory_format=torch.legacy_contiguous_format).exponential_().log() ) # ~Gumbel(0,1) return (logits + gumbels) / tau # ~Gumbel(logits,tau)

Using (logits + gumbels) / tau has the effect of sampling with temperature 1, regardless of the value of tau passed.

shermansiu commented 9 months ago

minimal.py output for TinyLlama:

Without LADE (RTX 3090)
``` Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Greedy Output: ---------------------------------------------------------------------------------------------------- <|im_start|>user How do you fine tune a large language model?<|im_end|> <|im_start|>assistant To fine tune a large language model, you need to make several adjustments to the parameters of the model. Here are the steps to follow: 1. Choose the right dataset: The first step is to choose the right dataset for your language model. You can use a large dataset like the IMDB movie review dataset or the Google News dataset. 2. Choose the right model: The second step is to choose the right model for your language model. There are several models available in the NLP community, such as BERT, GPT-2, and XLM-R. 3. Adjust the learning rate: The learning rate is the rate at which the model updates its parameters. You can adjust the learning rate by tuning the hyperparameters of the model. 4. Adjust the regularization: The regularization is the process of reducing the model's complexity by adding constraints or regularization terms. You can adjust the regularization by tuning the hyperparameters of the model. 5. Monitor the performance: Monitor the performance of the model by using metrics like accuracy, precision, and recall. You can use tools like TensorBoard or Sklearn's metrics to monitor the model Generated Tokens: 256 Generation Speed: 53.458491996524465 tokens/s Sampled Output [temperature = 0.01]: ---------------------------------------------------------------------------------------------------- <|im_start|>user How do you fine tune a large language model?<|im_end|> <|im_start|>assistant To fine tune a large language model, you need to make several adjustments to the parameters of the model. Here are the steps to follow: 1. Choose the right dataset: The first step is to choose the right dataset for your language model. You can use a large dataset like the IMDB movie review dataset or the Google News dataset. 2. Choose the right model: The second step is to choose the right model for your language model. There are several models available in the NLP community, such as BERT, XLM, and RoBERTa. 3. Adjust the learning rate: The learning rate is the rate at which the model updates its parameters. You can adjust the learning rate by tuning the hyperparameters of the model. 4. Adjust the regularization: The regularization is the process of reducing the model's complexity by adding constraints or regularization terms. You can adjust the regularization by tuning the hyperparameters of the model. 5. Monitor the performance: Monitor the performance of the model by using metrics like accuracy, precision, and recall. You can use tools like TensorBoard or Sklearn's metrics to monitor the model's Generated Tokens: 256 Generation Speed: 58.388901266131164 tokens/s ```

With LADE:
```Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. ==========================ACCELERATION===SUMMARY====================================== Generated tokens: 1 Total steps: 1 Compression ratio: 1.0 ====================================================================================== ==========================ACCELERATION===SUMMARY====================================== Generated tokens: 256 Total steps: 138 Compression ratio: 1.86 ====================================================================================== ==========================ACCELERATION===SUMMARY====================================== Generated tokens: 256 Total steps: 138 Compression ratio: 1.86 ======================================================================================Greedy Output: ---------------------------------------------------------------------------------------------------- <|im_start|>user How do you fine tune a large language model?<|im_end|> <|im_start|>assistant To fine tune a large language model, you need to make several adjustments to the parameters of the model. Here are the steps to follow: 1. Choose the right dataset: The first step is to choose the right dataset for your language model. You can use a large dataset like the IMDB movie review dataset or the Google News dataset. 2. Choose the right model: The second step is to choose the right model for your language model. There are several models available in the NLP community, such as BERT, XLM, and RoBERTa. 3. Adjust the learning rate: The learning rate is the rate at which the model updates its parameters. You can adjust the learning rate by tuning the hyperparameters of the model. 4. Adjust the regularization: The regularization is the process of reducing the model's complexity by adding constraints or regularization terms. You can adjust the regularization by tuning the hyperparameters of the model. 5. Monitor the performance: Monitor the performance of the model by using metrics like accuracy, precision, and recall. You can use tools like TensorBoard or Sklearn's metrics to monitor the model's Generated Tokens: 256 Generation Speed: 83.03608243954638 tokens/s Sampled Output [temperature = 0.01]: ---------------------------------------------------------------------------------------------------- <|im_start|>user How do you fine tune a large language model?<|im_end|> <|im_start|>assistant To fine tune a large language model, you need to make several adjustments to the parameters of the model. Here are the steps to follow: 1. Choose the right dataset: The first step is to choose the right dataset for your language model. You can use a large dataset like the IMDB movie review dataset or the Google News dataset. 2. Choose the right model: The second step is to choose the right model for your language model. There are several models available in the NLP community, such as BERT, XLM, and RoBERTa. 3. Adjust the learning rate: The learning rate is the rate at which the model updates its parameters. You can adjust the learning rate by tuning the hyperparameters of the model. 4. Adjust the regularization: The regularization is the process of reducing the model's complexity by adding constraints or regularization terms. You can adjust the regularization by tuning the hyperparameters of the model. 5. Monitor the performance: Monitor the performance of the model by using metrics like accuracy, precision, and recall. You can use tools like TensorBoard or Sklearn's metrics to monitor the model's Generated Tokens: 256 Generation Speed: 78.39760457443923 tokens/s ```

shermansiu commented 9 months ago

Also, here are some random information theoretic experiments that are somewhat related to the above proofs.

Why do I keep doing this?
```python3 import numpy as np from scipy.special import kl_div from sklearn.metrics.pairwise import cosine_similarity import matplotlib.pyplot as plt RANGE = 200 def generate_random_logits(size): """Generate random logits of given size.""" return np.random.uniform(-RANGE, RANGE, size) def softmax(logits): """Compute the softmax probabilities from logits.""" exp_logits = np.exp(logits - np.max(logits)) # Subtracting max(logits) for numerical stability return exp_logits / np.sum(exp_logits) def generate_random_prob_vector(size): """Generate a random probability vector of given size.""" # rand_vector = np.random.rand(size) # rand_prob_vector = rand_vector / np.sum(rand_vector) # return rand_prob_vector return softmax(generate_random_logits(size)) def kl_divergence(p, q): """Compute the Kullback-Leibler (KL) divergence between two probability vectors.""" return np.sum(kl_div(p, q)) def compute_l1_norm(vector): """Compute the L1 norm of a vector.""" return np.sum(np.abs(vector)) def plot_dot_product_vs_kl_divergence(prob_vectors): """Plot the dot product vs. KL divergence for each pair of probability vectors.""" num_vectors = len(prob_vectors) dot_products = np.zeros((num_vectors, num_vectors)) kl_divergences = np.zeros((num_vectors, num_vectors)) for i in range(num_vectors): for j in range(num_vectors): dot_products[i, j] = np.dot(prob_vectors[i], prob_vectors[j]) kl_divergences[i, j] = kl_divergence(prob_vectors[i], prob_vectors[j]) plt.figure(figsize=(10, 6)) plt.scatter(dot_products, kl_divergences, s=100) plt.title('Dot Product vs. KL Divergence') plt.xlabel('Dot Product') plt.ylabel('KL Divergence') plt.grid(True) plt.show() def plot_cosine_similarity_vs_kl_divergence(prob_vectors): """Plot the cosine similarity vs. KL divergence for each pair of probability vectors.""" num_vectors = len(prob_vectors) cosine_similarities = np.zeros((num_vectors, num_vectors)) kl_divergences = np.zeros((num_vectors, num_vectors)) for i in range(num_vectors): for j in range(num_vectors): cosine_similarities[i, j] = cosine_similarity([prob_vectors[i]], [prob_vectors[j]])[0, 0] kl_divergences[i, j] = kl_divergence(prob_vectors[i], prob_vectors[j]) plt.figure(figsize=(10, 6)) plt.scatter(cosine_similarities, kl_divergences, s=100) plt.title('Cosine Similarity vs. KL Divergence') plt.xlabel('Cosine Similarity') plt.ylabel('KL Divergence') plt.grid(True) plt.show() def plot_dot_product_vs_l1_distance(prob_vectors): num_vectors = len(prob_vectors) dot_products = np.zeros((num_vectors, num_vectors)) l1_distances = np.zeros((num_vectors, num_vectors)) for i in range(num_vectors): for j in range(num_vectors): dot_products[i, j] = np.dot(prob_vectors[i], prob_vectors[j]) l1_distances[i, j] = compute_l1_norm(prob_vectors[i] - prob_vectors[j]) # Plotting plt.figure(figsize=(12, 6)) plt.subplot(1, 2, 1) plt.scatter(dot_products, l1_distances, s=100) plt.title('Dot Product vs. L1 Distance') plt.xlabel('Dot Product') plt.ylabel('L1 Distance') plt.grid(True) plt.subplot(1, 2, 2) plt.hist(l1_distances[np.triu_indices(num_vectors, k=1)], bins=20, color='green', alpha=0.7) plt.title('Histogram of L1 Distance') plt.xlabel('L1 Distance') plt.ylabel('Frequency') plt.grid(True) plt.tight_layout() plt.show() def plot_kl_divergence_histogram(prob_vectors): """Plot the histogram of KL divergences for each pair of probability vectors.""" num_vectors = len(prob_vectors) kl_divergences = np.zeros((num_vectors, num_vectors)) for i in range(num_vectors): for j in range(num_vectors): kl_divergences[i, j] = kl_divergence(prob_vectors[i], prob_vectors[j]) # Flatten the upper triangular part of the matrix (excluding the diagonal) flat_kl_divergences = kl_divergences[np.triu_indices(num_vectors, k=1)] plt.figure(figsize=(10, 6)) plt.hist(flat_kl_divergences, bins=20, color='blue', alpha=0.7) plt.title('Histogram of KL Divergences') plt.xlabel('KL Divergence') plt.ylabel('Frequency') plt.grid(True) plt.show() def main(): np.random.seed(42) # Set a seed for reproducibility k = 100 # Number of random probability vectors prob_vectors = [generate_random_prob_vector(100) for _ in range(k)] plot_dot_product_vs_kl_divergence(prob_vectors) plot_cosine_similarity_vs_kl_divergence(prob_vectors) plot_dot_product_vs_l1_distance(prob_vectors) plot_kl_divergence_histogram(prob_vectors) ``` ![dot product vs KL divergence of random softmax vectors](https://github.com/hao-ai-lab/LookaheadDecoding/assets/12627125/cace206c-fcbf-4eee-babc-cc79714b4116) ![cosine similarity vs KL divergence of random softmax vectors](https://github.com/hao-ai-lab/LookaheadDecoding/assets/12627125/99eeb38a-856c-4ffb-9370-639b3c4020d4) ![dot product vs L1 norms, histogram of L1 norms of random softmax vectors](https://github.com/hao-ai-lab/LookaheadDecoding/assets/12627125/0896ac0d-8ed2-40f8-b85a-24f49b1a9860) ![histogram of KL divergences](https://github.com/hao-ai-lab/LookaheadDecoding/assets/12627125/7aaf2b77-8961-4d13-ae16-e5bd978fa353) I know I could have left the probability vector self-comparisons out, but I wanted to leave the edge cases in so that they would be plotted.

Dot product vs. total variation distance

Empirically, it seems that if $1-\epsilon\le\mathbf p\cdot \mathbf q\le 1$, then $\frac12 \lVert \mathbf p -\mathbf q\rVert_1 \le \epsilon$ (but not the other way around, the reverse bound seems a lot looser).

In the case where $\mathbf p_\omega = 1$, then we have equality. i.e. $\frac12 \lVert \mathbf p -\mathbf q\rVert1=\frac12 (p\omega-q\omega + \sum{k\neq \omega}(q_k-pk)) = \frac12 (1-q\omega + \sum_{k\neq \omega}(qk-0))= \frac12 (1-q\omega + 1-qk)=1-q\omega=1-\mathbf p \cdot\mathbf q$.

The general proof is actually quite similar. Let $A\subseteq [K]$ be the subset of indices where $p_k\ge q_k$. Then

$$ \begin{align} \frac12 \lVert \mathbf p -\mathbf q\rVert1&=\frac12 (\sum{k\in A}(p_k-qk)+\sum{k\not\in A}(q_k-pk))\ &=\frac12 (\sum{k\in A}pk+\sum{k\not\in A}qk-\sum{k\in A}qk-\sum{k\not\in A}pk)\ &\le\frac12 (1+1-\sum{k\in A}p_kqk-\sum{k\not\in A}p_kq_k)\ &=1-\mathbf p\cdot \mathbf q \end{align} $$

So if $1-\epsilon\le\mathbf p\cdot \mathbf q$ or $1-\mathbf p\cdot \mathbf q\le \epsilon$, then $\frac12 \lVert \mathbf p -\mathbf q\rVert_1 \le \epsilon$.

Dot product vs KL Divergence

Similarly, suppose that the maximum difference between logits, or equivalently, the maximum difference between log probabilities is $M$. Then $\frac1M D_{KL}(P||Q)\le 1-\mathbf p\cdot \mathbf q$. For $k\in A$, $p_k\ge q_k$, so $\ln\frac{p_k}{q_k}\ge0$. (If $k\not\in A$, then it is negative).

(This proof is harder...)

Edit: Not true for some values of $\mathbf p$ and $\mathbf q$.

Counter-example
```python3 import numpy as np from scipy.special import kl_div p = np.array([0.95, 0.05]) q = np.array([0.5, 0.5]) # Calculate 1 - p.q one_minus_dot_product = 1 - np.dot(p, q) # Calculate normalized KL-Divergence normalized_kl_divergence = kl_div(p, q).sum() / (np.log(p)-np.log(q)).max() print(f"1 - p.q: {one_minus_dot_product:.4f}") print(f"Normalized KL Divergence: {normalized_kl_divergence:.4f}") ``` ``` 1 - p.q: 0.5000 Normalized KL Divergence: 0.7706 ```

shermansiu commented 9 months ago

So, with iid Gumbel noise vectors for both distributions, you get

$P(I=J) = \mathbf p\cdot \mathbf q$

The following are probabilities for a single Jacobi decoding step. Obviously, due to the autoregressive nature of LLMs, subsequent probabilities depend on which tokens are accepted, which causes the math to not simplify to a nice closed form when considering more than 1 step.

For the sake of completion, I'll include the following n-gram statistics, but they follow from basic probability theory:

Getting a n-gram of length at least $n$ (i.e. discard run at first failure, count number of successes from first Bernoulli trial):

$\prod\limits_{i=1}^n \mathbf p^{(i)}\cdot \mathbf q^{(i)}$

Getting a n-gram of length at exactly $n$:

$(1-\mathbf p^{(n+1)}\cdot \mathbf q^{(n+1)})\prod\limits_{i=1}^{n} \mathbf p^{(i)}\cdot \mathbf q^{(i)}$

Expected n-gram length (we only go up to $N$ trials):

$\sum\limits{n=1}^{N-1} [n(1-\mathbf p^{(n+1)}\cdot \mathbf q^{(n+1)})\prod\limits{i=1}^{n} \mathbf p^{(i)}\cdot \mathbf q^{(i)}]+N\prod\limits_{i=1}^N \mathbf p^{(i)}\cdot \mathbf q^{(i)}$

For the same $K$ iid Gumbel noise vectors for both distributions (e.g. because we are evaluating what would have happened if we had chosen to sample from $\mathbf p$ vs. $\mathbf q$ when using a pseudo-random number generator (PRNG)):

$P(I=J) = \sum_k \frac1{\sum_j \exp(\max(\theta_j-\theta_k, \phi_j-\phi_k))}= \sum_k \frac1{\sum_j \exp(\max(\ln p_j-\ln p_k, \ln q_j-\ln q_k))}$

Since $\exp$ is monotonic, $\sum_k \frac1{\sum_j \exp(\max(\ln p_j-\ln p_k, \ln q_j-\ln q_k))}=\sum_k \frac1{\sum_j \max( p_j/p_k, q_j/q_k)}$

The equivalent three n-gram statistics for this case are a trivial extension of the case where you have iid Gumbel noise vectors and is left as an exercise for the reader. 🤗

shermansiu commented 9 months ago

Now, suppose we want to validate if an $N$-gram is a valid completion. Now, suppose we have a sequence of $N$ Gumbel noise vectors, each of dimension $K$. Then we'll accept a completion sampled from an approximate distribution $Q$ if it is the same completion when sampled (using Gumbel sampling) from the true distribution $P$. Thus, we've reduced the problem of checking if the distributions are similar enough to checking if a sample from the approximate distribution is the same as a sample from the true distribution (I thought this is what we were doing all along, but I realized that the code doesn't actually implement this).

This means that we'll need to store the Gumbel values in-memory until we've verified that the token at a particular time step is accurate. Moreover, instead of using different Gumbel noise vectors each time (as the current implementation does), we'll need to use the same Gumbel noise vectors at each time step.

Obviously, this means that we might unnecessarily discard valid generations if the entropy of the next token distribution is high (i.e. close to being uniformly random). But this way, we have stronger theoretical guarantees for verification.

@knagrecha This idea isn't dead, right? I noticed that you haven't incorporated Yard1's suggestions into the code from a month ago.

knagrecha commented 9 months ago

Not dead, just my lack of skill with GitHub 😅. I hit the “commit” button on the comment and thought it would integrate.

knagrecha commented 9 months ago

That should do it for Yard1’s suggestion. I should stop handling my GitHub threads & commits on Chrome mobile…leads to accidental uncommitted changes like this.

shermansiu commented 9 months ago

Happy new year! 🎆

Also, $\frac1{\sum_j \max(p_j/p_k,q_j/q_k)}$ is a weird function. 1. What does it look like? And 2. how much better is it to use the same Gumbel noise vectors for each time step?

Remark: the paired next token equality probability $\sum_k\frac1{\sum_j \max(p_j/p_k,q_j/q_k)}$ kind of looks like a modified version of the harmonic mean... I also need a better name for the event of "next token equality", because that is a mouthful to say.

What does it look like?

For now, let's examine the probability for the $k$'th token only.

We previously (https://github.com/hao-ai-lab/LookaheadDecoding/pull/6#issuecomment-1866930575) established that $\frac1{\sum_j \max(p_j/p_k,q_j/q_k)} \le p_k$ and $\le q_k$ (except when $p_k$ or $q_k$ = 0, but we'll just define it to be 0 for now), so it is $\le \min(p_k,q_k)$. Thus, it is also $\le 1$ by transitivity. Because probability vectors are non-negative, $0\le \frac1{\sum_j \max(p_j/p_k,q_j/q_k)}$, but can we give a better lower bound?

Minimizing $\frac1{\sum_j \max(p_j/p_k,q_j/qk)}$ is the same thing as maximizing its denominator, which is equivalent to $1 + \sum{k\neq j} \max(p_j/p_k,q_j/q_k)$. Because the L1 norms of probability vectors are equal to 1, to maximize the contribution of each sum, the supports of $p$ and $q$ over $[K]\setminus{k}$ must be disjoint.

In other words, $1 + \sum_{k\neq j} \max(p_j/p_k,q_j/q_k) \le 1 + \frac1{p_k}(1-p_k)+\frac1{q_k}(1-q_k)= \frac1{p_k}+\frac1{q_k} - 1$ (with equality when the supports of $p$ and $q$ over $[K]\setminus{k}$ are disjoint).

So, $\frac1{\frac1{p_k}+\frac1{q_k} - 1}\le \frac1{\sum_j \max(p_j/p_k,q_j/q_k)} \le \min(p_k,q_k)$ for when $p_k$ and $q_k$ are in $(0,1]$.
$plot of $\frac1{\frac1{p_k}+\frac1{q_k} - 1}$ between 0 and 1 on all axes$ $plot of $\min(p_k,q_k)$ between 0 and 1 on all axes$
Thus, $\frac1{\sum_j \max(p_j/p_k,q_j/q_k)}$ is a smooth approximator of $\min(p_k,q_k)$ with equality at the boundary conditions where at least one of $p_k$ or $q_k$ lie in $\{0,1\}$ (For the lower bound, if either $p_k$ or $q_k$ is equal to 0, we take the limit, but if they are both 0, then we set it to 0 for convenience (as the limit DNE). We then use the Squeeze Theorem.).

Full graph
![full graph comparing $\frac1{\frac1{p_k}+\frac1{q_k} - 1}$ and $\min(p_k,q_k)$](https://github.com/hao-ai-lab/LookaheadDecoding/assets/12627125/2e6bd704-3f1b-4ec5-9632-438105940ca6)

As it turns out, the equality probability under unpaired Gumbel sampling is also a smooth approximator of $\min(p_k,q_k)$ with equality at the boundary square formed when $|\{p_k,q_k\}\cap\{0,1\}|>0$.

So how much better is paired sampling than unpaired sampling, really?

Advantage of using paired vs unpaired Gumbel sampling

So, how much better is paired sampling than unpaired sampling (which has probability $p_k q_k$)? We can take the ratio of probabilities to find out.

Upper bound ratio

Since $\min(p_k,q_k)\max(p_k,q_k)=p_kq_k$, $\frac{p_kq_k}{\min(p_k,q_k)}=\max(p_k,q_k)$.

Lower bound ratio

$p_kq_k (\frac1{p_k}+\frac1{q_k} - 1)=p_k+q_k-p_kq_k$.

So, the ratio of probabilities between using unpaired vs paired Gumbel sampling is roughly equal to $\max(p_k,q_k)$.

But what about the ratio when $p_k=q_k$? Then the ratio curve isn't linear for the lower bound ratio as it is for the upper bound ratio, but quadratic instead.

So it seems like even for unpaired sampling, the approximation tends to be surprisingly good, even if not perfect.

Conclusion

So in summary, there are 2 advantages for using the same Gumbel noise vectors at each time step:

We guarantee that the completion could have been sampled from the true distribution (which we showed here: https://github.com/hao-ai-lab/LookaheadDecoding/pull/6#issuecomment-1869882564)

There is a larger probability of getting the same vectors at a particular time step.

Next steps: Show that there is an empirical speed boost when using the same Gumbel noise vectors at each time step.

(Also, do human pairwise preferences between approximate and true completions indicate that true completions are better than approximate ones, even at higher temperatures?)

shermansiu commented 8 months ago

PR Changes

I made the following changes to this PR:

https://github.com/knagrecha/LookaheadDecoding-sampling/pull/1: Merge the latest LADE changes onto the sampling version

https://github.com/knagrecha/LookaheadDecoding-sampling/pull/2: Fix the in-place Gumbel sampling

Viol2000 commented 4 months ago

Sampling is supported in a rejection sampling way like previous work SpecInfer. Gumbel can be a good method for in place sampling. Close this thread now. And may reopen one for in place support of Gumbel.

© Githubissues.

Githubissues is a development platform for aggregating issues.