Min P style sampling - an alternative to Top P/TopK

kalomaze commented 1 year ago

Feature request

This is a sampler method already present in other LLM inference backends that aims to simplify the truncation process & help accomodate for the flaws/failings of Top P & Top K. Min P.

What Min P is doing is simple: we are setting a minimum percentage value that a token must reach to be considered during sampling. However, this is not a hard limit. The minimum will 'scale' based on the top token's probability. So, if you have a Min P value of 0.1 (for example), that would mean your base Min P requirement is 10%. So if your top token is 25%, that means it will only consider tokens that have at least 2.5% probability.

This method subjectively seems to improve results across the board with no noticeable downside, and has been merged into the following FOSS LLM backends:

llama.cpp
vllm
text-generation-webui (through both the HF loaders and llama-cpp-python)

I would suggest a default of 0.05.

Motivation

I noticed certain 'flaws' in the popular Top P sampling method:

When the model does not have sufficient confidence/concentration on the next token candidate(s), it's possible for the sampler to consider many tokens that are highly unlikely compared to the few choices it has confidence in.
Top K helps limit the amount of 'low confidence' tokens period as a supplement to Top P, but this often comes at a cost of token choice diversity (often arbitrarily).
In addition to this, Top P can sometimes cut reasonable tokens. What if there's a 90.1% probability token, followed by a 9% probability token? A Top P value of 0.90 would completely gloss over the 9% token in this instance.

For this reason I made Min P which seems to have positive reception across the board.

Your contribution

I may consider making a PR for this.

ArthurZucker commented 1 year ago

fyi @gante 🤗

gante commented 1 year ago

Hi @kalomaze 👋 Thank you for opening this issue!

In addition to Temperature, Top p, and Top k, which apply distribution-agnostic transformations, we have three other distribution-aware transformations:

These techniques do a similar thing to what you mention: they apply a "Top p"-like transformation, adjusted by the probability distribution.

Since we already have similar techniques, backed up by papers with benchmarks, I'm reluctant to add this technique without further benchmarks. Maintenance is a heavy long-term burden in transformers that we want to contain 🤗

kalomaze commented 1 year ago

Hi @kalomaze 👋 Thank you for opening this issue!

In addition to Temperature, Top p, and Top k, which apply distribution-agnostic transformations, we have three other distribution-aware transformations:

Typical P Decoding

Epsilon Sampling

Eta Sampling

These techniques do a similar thing to what you mention: they apply a "Top p"-like transformation, adjusted by the probability distribution.

Since we already have similar techniques, backed up by papers with benchmarks, I'm reluctant to add this technique without further benchmarks. Maintenance is a heavy long-term burden in transformers that we want to contain 🤗

The scaleability of Min P in comparison to Top P seems to objectively be more consistent beyond just theorycrafting.

Min P also highly interpretable in comparison to Locally Typical sampling which gets into denser, more subjective interpretations of information theory, which begs to question whether or not it's overdesigned. This makes Typical sampling less intuitive to use for the end user.

In addition to this, Typical sampling, Epsilon sampling, and Eta sampling as techniques have seen extremely limited real world adoption in terms of open source LLM interfaces, which, at large, have continued to use Top K and Top P in their wake. If not those two, Mirostat has seen mild popularity, but I would argue the latter two samplers (Epsilon sampling, Eta sampling) are perhaps less proven in terms of subjective quality.

In conclusion, Min P:

Is more interpretable to end users and developers compared to the methods you listed. This has less risk of unintended behavior in terms of achieving the same goal as Top P / Top K when compared to typical sampling, which is less proven in the 'real world'.
It has been proven to scale more consistently in comparison to Nucleus sampling in practice, as mentioned earlier
It has consistently seen positive reception and adoption from the open source language model community at large to the point where most inference backends (vllm, llama.cpp, exllamav2, text-generation-webui's HF loaders, etc) have adopted it:

I will also note that a common issue for open source language models is the lack of truly objective metrics for testing beyond manual human analysis; so any apparently 'standard' testing metrics should be given serious scrutiny before they are considered absolute and final measures in which to compare sampler methods.

If there are any specific metrics you would like to see on any specific models, I can try to provide them to support my case beyond the subjective results and widespread adoption of the technique (which I figured would stand out on their own, but having numbers would be beneficial... assuming we can trust the numbers, which is an assumption I'm hesitant to make without sufficient fundamental evidence for their use beyond "arxiv papers used it")

gante commented 1 year ago

@kalomaze precisely because in the past we've added techniques that had some results but ended up not having much use (like Eta sampling) I'm asking for additional validation :) For instance, Eta sampling had a blind human preference test, where it was shown as preferred over top p, with a relatively low sample size (N=294). However, the upside (and the marketing) was not large enough, so the community decided to stick with simpler, established techniques like top p.

Just because other repos have merged your technique, it does not make it inherently good. ML is a data-driven science, so let's collect data -- I have yet to see any data beyond a few examples. Note that this is nothing against your creation, I actually agree with it in principle. transformers is a large library with a few maintainers, we have to be conscious of what we add here.

A good test would be to compare your technique against others with blind human preference 🤗 There is nothing better than human preference -- I'd be happy to participate in the evaluation.

kalomaze commented 1 year ago

A good test would be to compare your technique against others with blind human preference 🤗 There is nothing better than human preference -- I'd be happy to participate in the evaluation.

Do we have enough people who are willing to test / evaluate this to rule out the margin of error, though? The main thing we are looking for is to minimize the included outliers when improving the truncation schemes (and those are usually low probability to begin with), and outliers are going to be hard to test for without sufficient data if you sample normally, unless we change the sampler to only pick the least likely token (as a way to measure the truncation consistency directly).

I've done exactly that before for Top P and Min P and I saw that Min P was an obvious improvement. Would you like me to reproduce that experiment but with Typical sampling? (Llama.cpp, my inference engine of choice, has a broken implementation of Typical sampling at the moment but there is a PR to fix that I can use, and Eta/Epsilon just aren't adopted anywhere else in the LLM world so I'd have to learn how to use Transformers to test those, which seems like it will be necessary for my future LLM tests)

I'm also aware that an appeal to popularity isn't hard evidence, but I think it's a stronger marker in this case than it would otherwise be given the context of LLM benchmarks and especially certain metrics (e.g perplexity) being dubiously unreliable markers of quality in the ML space.

gante commented 1 year ago

Do we have enough people who are willing to test / evaluate this to rule out the margin of error, though?

Between your reddit and my twitter/LI reaches, we will definitely have more than enough people to run a proper study. If you agree to build the interface for the study (e.g. through a HF spaces), I'd be more than happy to promote it! I also have the power to allocate GPUs to a space in order to run the study 💪

The main thing we are looking for is to minimize the included outliers when improving the truncation schemes (and those are usually low probability to begin with), and outliers are going to be hard to test for without sufficient data if you sample normally, unless we change the sampler to only pick the least likely token (as a way to measure the truncation consistency directly).

I agree that the biggest difference is in the outliers. However, each output may have tens or hundreds of tokens, so the effect of bad "1% probability tokens" is not that hard to observe :) If there is noticeable human preference after >1000 samples, then we can be sure that it makes a difference.

Also, if the test turns out to be a success, you'd gain much more power over the distribution of your technique :D There are no questions over human preference.

especially certain metrics (e.g perplexity) being dubiously unreliable markers of quality in the ML space.

100% agreed

kalomaze commented 1 year ago

Do we have enough people who are willing to test / evaluate this to rule out the margin of error, though?

Between your reddit and my twitter/LI reaches, we will definitely have more than enough people to run a proper study. If you agree to build the interface for the study (e.g. through a HF spaces), I'd be more than happy to promote it! I also have the power to allocate GPUs to a space in order to run the study 💪

The main thing we are looking for is to minimize the included outliers when improving the truncation schemes (and those are usually low probability to begin with), and outliers are going to be hard to test for without sufficient data if you sample normally, unless we change the sampler to only pick the least likely token (as a way to measure the truncation consistency directly).

I agree that the biggest difference is in the outliers. However, each output may have tens or hundreds of tokens, so the effect of bad "1% probability tokens" is not that hard to observe :) If there is noticeable human preference after >1000 samples, then we can be sure that it makes a difference.

Also, if the test turns out to be a success, you'd gain much more power over the distribution of your technique :D There are no questions over human preference.

especially certain metrics (e.g perplexity) being dubiously unreliable markers of quality in the ML space.

100% agreed

Understood; I've never made a HF space, so that'd be new territory for me, though I'll look into it for sure (since having empirical data would be helpful.)

What would be a fair comparison value to Top P? Or would you prefer something where all methods all evaluated (that might be too aggressive, though?) The next problem, I think, is finding an 'equivalent scale' for all methods. The scale of Min P is obvious and understood; but for Epsilon & etc it's difficult for me to determine...

gante commented 1 year ago

@kalomaze I'd suggest to start simple, going against top p alone. Less work and straight to the point. If we realize we're gathering enough participants, then we can expand it to multiple models and multiple strategies, for a better overview.

I can help you with any roadblock or questions you have along the way: the results are very much of my interest! 💛

(and I'm crossing my fingers for Min P to be successful!)

kalomaze commented 12 months ago

@kalomaze I'd suggest to start simple, going against top p alone. Less work and straight to the point. If we realize we're gathering enough participants, then we can expand it to multiple models and multiple strategies, for a better overview.

I can help you with any roadblock or questions you have along the way: the results are very much of my interest! 💛

(and I'm crossing my fingers for Min P to be successful!)

I see, that's very doable.

How about:

Top P 0.98 vs Min P 0.02
Top P 0.95 vs Min P 0.05
Top P 0.90 vs Min P 0.1
Top P 0.80 vs Min P 0.2

At temperature 1.0?

gante commented 12 months ago

@kalomaze sounds good (I'm assuming you're more sensible than me to what a good pairing looks like :) )

I'd perhaps suggest lowering the temperature a bit, to 0.7-0.8 (which is what most LLMs use by default nowadays)

kalomaze commented 12 months ago

I'd perhaps suggest lowering the temperature a bit, to 0.7-0.8 (which is what most LLMs use by default nowadays)

The API docs for OpenAI suggest either lowering temperature or using Top P, but not both, which seems to imply truncation sampling was intended for use with a standard temperature (which makes sense to me); and the default provided is also 1.0 for GPT in the first place. Temperature 1.0 is also representative of the original logit scores transformed into probabilities, and isn't an arbitrary transformation, so it makes the most sense to me at least, to compare at this value (unless you have other reasons for it).

gante commented 12 months ago

@kalomaze temperature can be seen as a post-hoc calibration of the model logits -- an underconfident model should use a temperature below 1.0 and vice-versa. You can also see it as sharpening (<1.0) or flattening (>1.0) the probability distribution. It does have some overlap with top p, with the difference that top p acts on the probabilities and temperature on log probabilities -- after top p, you can end with the same possible tokens, but the temperature will have an impact on their relative distribution.

The optimal temperature changes across models and tasks, with llama models excelling around ~0.7 for most tasks. For instance, the starcoder model is recommended to be used with temperatures around ~0.3 :) My suggestion for 0.7-0.8 assumed the use of models like llama or mistral

menhguin commented 7 months ago

hi @gante , just an update on this. I'm Minh, coauthor of kalomaze's research paper introducing Min P. I've started running EleutherAI's eval harness on Min P, and early results seem quite strong. file-kewGofOJNIvYebbGrUefPVOh

This is a test of Top_P = 0.9 vs Min_P = 0.9, for GSM8K_COT (8-shot), exact match on the EleutherAI eval harness.

Strangely enough, Min_P = 0.9 is optimal at high temps. Min_P =0.1 at temp = 2 will get you 6%, which is better than Top_P's 0%, but nowhere near as impressive as 0.40%.

We'll be conducting more tests to isolate confounding variables, optimise performance, run other evals etc. But for now you can replicate this quite easily with the following settings (min_p = 0.9, top_p =1 (disabled) and temp =2 to 3). There might be some bugs/errors, but getting basically zero performance drop at temp = 3 seems quite significant.

!python /content/lm-evaluation-harness/lm_eval \
    --model vllm \
    --model_args pretrained=mistralai/Mistral-7B-v0.1,dtype=auto \
    --batch_size "auto" \
    --tasks gsm8k_cot \
    --num_fewshot 8 \
    --wandb_args project=lm-eval-harness-integration \     # delete if you don't want to use wandb to log results
    --log_samples \
    --output_path ./lm-eval-output/ \
    --gen_kwargs min_p=0.9,top_p=1,temperature=3,do_sample=True \
    --device cuda

Test colab used (used VLLM since this is not available on HF). Trying to figure out how to share wandb results directly.

Edit: here's a colab where you can replicate the Min_P tests at temperature = 9 https://colab.research.google.com/drive/1-gGcr7AyU9BdgkTxF8CTVQ9MpoqWvZhJ

So far these are just mathematical reasoning evals I can run quickly on colab (vs setting up human preference). Do let us know if you have any creativity-focused evals in mind, or if you'd still prefer setting up human preference evals resembling LMSys Chatbot Arena.

menhguin commented 7 months ago

updated with better labelling, more optimised settings:

I tested Min_P = 0.9 vs Top_P = 0.1 for consistency, in practice scores in 0.7-0.8 range were comparable and warrant testing
With Min_P = 0.9, there was negligible performance degradation from temp 0-5, and marginal degradation at 5-10
Score degradation is basically linear from temp 5 to 30+. Will optimise more later, I've only spent like 4 hours tweaking settings. It's hard to find the optimal test settings that make sense, but every head-to-head matchup seems to be favourable for min_p (0.9 vs 0.1, 0.8 vs 0.2, 0.5 vs 0.5, 0.1 vs 0.9)

There's an argument to be made that no one would use 0.1 top_p anyway, but hmmmmmn it's hard to figure out what settings should represent "realistic" user behaviour, since we don't ... have much info on user sampler preferences?

Will now test other evals, try to quantify creativity/diversity and try human preference evals. Again, any suggestions welcome.

gante commented 7 months ago

@menhguin thank you for the thorough investigation -- this is the sort of data I was looking for! There seems a parameterization range in which min_p excels, which means we should add it to 🤗 transformers.

@menhguin @kalomaze would any of you be interested in opening a PR?

since we don't ... have much info on user sampler preferences

I struggle with this limitation myself 💔 It would be cool to have an LMSYS-like table for generation parameterization!

menhguin commented 7 months ago

@menhguin thank you for the thorough investigation -- this is the sort of data I was looking for! There seems a parameterization range in which min_p excels, which means we should add it to 🤗 transformers.

@gante Just an update on evals done on VLLM:

I did a closer investigation of actual user preferences (based on the SillyTavern Discord). Users tend to prefer 0.05 and 0.1 min p. Here's the updated graph. It's not as hilariously outmatched, but still very significant.

here is another independent eval done by the creator of AQLM, this time on EQ Bench's creative writing eval (from https://t.me/senior_augur/76 )

We're almost done with evals (just beefing up the methodology to pre-empt the critiques of random peer reviewers), and hope to finalise the paper by the end of the month.

Again, we've proven quantitative improvements on the relevant benchmarks and we have user reports of their preferences. Theoretically, the only thing we don't have is a user preference ranking LMSys-style, but that feels ... outside our scope.

@kalomaze says he'll do a PR when he wakes up

I feel like the most convincing argument is that min p has no noticeable downside on more deterministic settings (lower temp, less selective sampling), and noticable upside on less deterministic settings (higher temp, more selective sampling). So it's arguably a straight improvement, unless someone discovers something really new.

Hellisotherpeople commented 7 months ago

This is a rare massive L from huggingface for putting Kalomaze and the related folks through what is useful but ultimately unnecessary work to prove what the community already knows - which is that Min P style sampling is awesome.

Huggingface is the "atlas" holding up the rest of the NLP ecosystem. They have a duty to support as many samplers as possible, even if there were worse justifications for implementing them.

Witnessing this has made me quite sad for the state of LLMs in general. Dynamic Samplers (and especially Min_P) are straight up better than top_p/top_k and even more sophisticated stuff like typicality sampling (which is dynamic) is still known to be not as good by a lot of actual users on r/localllama today.

Slowing down the proliferation of techniques like this does a whole lot to hurt the general perception of the quality of open source LLMs, and incentivizes the community to push towards infinitely scaling paramater counts as the only "solution" to issues with LLM output.

Hellisotherpeople commented 7 months ago

Also @menhguin and @kalomaze , I'm extremely interested in helping out on the research paper you two are writing in any way that I can. I have access to significant amounts of compute resources and a rather large network of professionals who will be more easily persuaded about the merits of this technique than the folks in this PR section.

amyeroberts commented 7 months ago

Thank you @menhguin for such detailed deep-dives into this and the detailed graphs!

@Hellisotherpeople It might seem counter-intuitive, but being selective about what we do and don't add to the library actually helps us move faster. We receive lots of PRs, issues and feature requests a day, and every new addition has a maintenance burden. It's therefore important that we're selective to make sure additions have high impact and the time spent adding and maintaining it is valuable. Asking for results and/or evidence of community interest is pretty standard. Even if those are proven, sometimes there's other reasons it makes sense to add later.

It might be frustrating to not see something you want immediately added to the library. The great thing about open-source is you can freely build and share your own code adding this feature!

gante commented 7 months ago

@Hellisotherpeople Another way of seeing it is as follows: we are selective about what we add here, and yet I took 2 weeks to get back at this message -- other requests, bugfixes, and general maintenance got in the way.

How much time would it take us, the maintainers, to have reasonable reply times if we were to accept most suggestions? How would you, a user, be able to find the tool that you need, in a vast sea of tools and flags? Curation through evidence ends up helping both sides, especially in the long run 🤗

gante commented 7 months ago

@menhguin @kalomaze let me know if you have the bandwidth to open a PR, otherwise I'd be happy to do so 🤗

menhguin commented 7 months ago

@menhguin @kalomaze let me know if you have the bandwidth to open a PR, otherwise I'd be happy to do so 🤗

@gante Kalo is away rn, I'm gonna guess the answer is "yes"

Kalo's currently working on Quadratic Sampling: https://github.com/ggerganov/llama.cpp/pull/6445 I'm trying to finish up the actual Min P paper within the next 2 weeks + grinding some leetcode for AI Safety research programs + prepping my research skills for my Hume AI internship.

I'm new + not super familiar with the HF Transformer repo, so it might end up in limbo this month. Honestly I don't mind try to do it before my internship on the 20th, but I'm trying to not break your prod with weird edge case bugs, so IDM you doing it haha.

You can reference the code from other inference engine PRs here:

https://github.com/ggerganov/llama.cpp/pull/3841 https://github.com/vllm-project/vllm/pull/1642 This is the exact implementation I'm referencing for the paper (see sample.hijack.py): https://github.com/oobabooga/text-generation-webui/pull/4449/files#diff-51532f26c3cdbf11129835e199b8f4f621d2ec5968ce71760d149e251ff51526

gante commented 7 months ago

@menhguin https://github.com/huggingface/transformers/pull/30639 I'm double-checking the quality of the outputs, but that should be it!

If you can, have a look at the PR 😉

menhguin commented 7 months ago

@gante I've reviewed it. It seems fine at a glance since you mainly referenced the original implementation + changed the relevant HF transformers files.

Main comments are regarding the whole min_p values being opposite of comparable top_p values and how that might confuse users, but that's not a blocker. The functionality is there so it seems OK.

The only part I might sorta worry about is logits_process (https://github.com/huggingface/transformers/pull/30639#discussion_r1589564551) due to aforementioned values issue. I can attempt to figure out what that does tmr, if you haven't by then.

gante commented 6 months ago

Added 👍

Here's a simple example running on main:

chat = [
    {"role": "system", "content": "You are a sassy, wise-cracking robot as imagined by Hollywood circa 1986."},
    {"role": "user", "content": "Hey, can you tell me any fun things to do in New York?"}
]

import torch
from transformers import pipeline, set_seed

set_seed(0)

pipe = pipeline("text-generation", "meta-llama/Meta-Llama-3-8B-Instruct", torch_dtype=torch.bfloat16, device_map="auto")
response = pipe(chat, max_new_tokens=512, do_sample=True, min_p=0.08, temperature=1.5)
print(response[0]['generated_text'][-1]['content'])

huggingface / transformers