Temperature effect on Select

guidance-ai / guidance

A guidance language for controlling large language models.

MIT License

18.85k stars 1.04k forks source link

Temperature effect on Select #794

Open luciolcv opened 5 months ago

luciolcv commented 5 months ago

Hi there,

I have some doubts about the process behind the select method.

Is there any detailed explanation about what happens under the hood while using select and gen? I mean, can select(['joke', 'poem']) perform actions that are equivalent to those performed by gen(regex='(joke|poem)', temperature=_)? If not, in what they differ?
Is there any impact due to temperature on the select? If yes, which is the default temperature used by the select?

Thanks!

parkervg commented 4 months ago

(First of all - I'm unsure the specific method guidance uses for these functions. But from what i know about constrained decoding, I think below would still be valid)

I recently came across this article from @AidanCooper that I really like: https://www.aidancooper.co.uk/constrained-decoding/

He's got a great example in there (using another constrained decoding library), which I've abbreviated and pasted below.

import sglang as sgl

@sgl.function
def us_president_choices(s):
    s += sgl.user("Name a US president.")
    s += sgl.assistant(
        "An example of a US president is " +
        sgl.gen("president", choices=["Donald Duck", "Millard Fillmore"])
    )

@sgl.function
def us_president_regex(s):
    s += sgl.user("Name a US president.")
    s += sgl.assistant(
        "An example of a US president is " +
        sgl.gen("president", regex=r"(Donald Duck|Millard Fillmore)")
    )

state_choices = us_president_choices.run()
state_regex = us_president_regex.run()
print(state_choices["president"])  # >>> Millard Fillmore
print(state_regex["president"])  # >>> Donald Duck

Takeaway:

The greedy algorithm used by the regex implementation is short-sighted, and cannot resist choosing the "Donald" option, despite ultimately landing on an incorrect answer.

Not having studied the guidance source code in great detail, if something else is happening behind the scenes that would make the above example invalid, I'd love to be corrected here!

wjn0 commented 4 months ago

I second the first question -- it would be useful to understand, because I very often see counterintuitive behaviour between unconstrained generation through raw transformers and constrained generation through guidance. I need to debug further, but it seems to me there's a few possibilities: top_p/top_k filtering; temperature (although I think this shouldn't matter because guidance is not random, i.e. always decodes greedily); (perhaps most insidiously) tokenization issues that are prompt-/use case-specific.

To answer your question @parkervg, a quick test would suggest that both select and gen (with regex) in guidance are greedy:

from transformers import AutoModelForCausalLM, AutoTokenizer, PretrainedConfig

import guidance

model_name = "unsloth/llama-3-8b-bnb-4bit"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
lm = guidance.models.Transformers(model=model, tokenizer=tokenizer)

# lm += "An example of a US president is the current president, "
lm += "The US president who freed the slaves with his Emancipation Proclamation was "
# choices = ["Abraham Lincoln", "Donald Duck"]
choices = ["Abraham Ozler", "A. Lincoln"]
lm += guidance.select(choices)
# lm += guidance.gen(regex=rf"({choices[0]}|{choices[1]})")
print(lm)
# Output: The US president who freed the slaves with his Emancipation Proclamation was Abraham Ozler

Funny enough, for my use case, I actually prefer this behaviour, but I'm having trouble reliably getting it with my model/prompt, so some insight as to the internals of gen, select, and any differences between raw transformers would still be useful.

wjn0 commented 4 months ago

Funny enough, for my use case, I actually prefer this behaviour, but I'm having trouble reliably getting it with my model/prompt, so some insight as to the internals of gen, select, and any differences between raw transformers would still be useful.

I've made a bit of progress on understanding the root of my issue, and I think it's something to do with tokenization rather than top_p or top_k - see #876 if you're curious. (I don't want to hijack this issue, but it relates to question (1) in the OP.)

Harsha-Nori commented 4 months ago

Hey all, thanks for the great discussion here! I'll first answer the second question, then bounce back to the first :).

In terms of the guidance API, the challenge with sampling parameters in general -- temperature, top_p, top_k, etc. -- is that we really need them to be part of our grammar tree so that we can selectively apply them across different ranges of generations in a model without needing to make fresh API calls (i.e. govern them as part of a stateless interaction).

The current best way to do this is to wrap ranges of generations in the with_temperature function. The tricky business here is that we want to let users override it with specific temperature values set lower in the stack, so that you can do things like a global temperature across a giant function, but surgically update a single gen inside of that to use a different one. The idea is that sampling parameter settings that are nested more deeply would take precedence.

Some ideas here in terms of API:

1) Setting them globally on the entire LM object (like we used to do in the model init) 2) Use a with Sampler(temperature=...): context manager, for a range of code inside a function. This makes it more convenient than a single function wrapper, but adds a layer of indentation. 3) with_sampling_params(grammar) wrapper functions, like a generalized with_temperature today 4) sampling parameter settings on each atomic library function (as kwargs to gen, select, json, etc.

Would appreciate thoughts from this group on what API feels ergonomic here!

Here's a code example of setting temperature on select in guidance today:

import guidance
from guidance import models, gen, select
from guidance.library import with_temperature

lm = models.Transformers("gpt2", device_map="mps", echo=False)
lm += "1+1=2. 2+2=4. 3+3="

# picks '6' every time
for i in range(10):    
    lm_temp0 = lm + select(['6', '7'], name="num")
    print(lm_temp0['num'])

# switches between options
for i in range(10):    
    lm_temp2 = lm + with_temperature(select(['6', '7'], name="num"), temperature=2)
    print(lm_temp2['num'])

Harsha-Nori commented 4 months ago

On how select works today -- we do enforce constraints greedily (i.e. one token at a time), which I totally appreciate has its challenges. This is how the Donald Duck/A. Lincoln style artifacts can creep up.

One thing we need to stress better that constraints are more like guardrails and don't fundamentally remove the requirement to prompt properly. If you change your prompts to tell the model what its choices are ahead of time -- even abstractly -- you'll generally get much better performance and avoid these artifacts of greedy decoding for most capable models. This is a broadly applicable principle (e.g. you'll get way better performance out of your JSON Schemas if you simultaneously pass the schema to the model as part of the prompt AND enforce constraints on generation, vs. just doing one or the other). Proper prompting and constrained decoding really are complementary.

Closed model providers who run constrained decoding in production often do silent prompt modifications/extensions to "prime" the model ahead of constraining. In guidance, we don't want to do silent modifications to user prompts -- part of the beauty is that you can always see and inspect exactly what's going to the model, and when. 'm trying to think of the best way to encourage this "best practice" -- perhaps more educational material, or a kwarg on some of our syntactic sugars like json() which append the schema to your prompt (which you can always disable)?

On the scientific side, we think there are non-greedy ways of addressing this too, that might come at the cost of more inference time compute. Greedy constraint enforcement is nice because it essentially doesn't slow the model down, but if you're willing to add more latency/compute, we can likely do much better. I think we'd want to do these tricks as "opt-in" though, because they're almost all linear in terms of the size/length of the select options.

Harsha-Nori commented 4 months ago

have to run for a bit but happy to answer more questions in the AM

parkervg commented 4 months ago

Thanks for the thorough explanations @Harsha-Nori ! Really interesting stuff.

I like the emphasis of constraints as guardrails, and not a silver bullet that allows you to bypass prompt refinement. As for the temperature API point - in the spirit of being as explicit as possible, I'm in favor of pattern 4 (sampling parameter settings on each atomic library function). I could imagine the following context manager pattern could get unwieldy:

with assistant():
    with Sampler(temperature=temperature):
        ....

Being an argument of the atomic function, this would 1) Remove the need for a distinct Sampler import 2) Allow for more explicit parameter-passing patterns 3) Give the new developer hints for the recommended interaction pattern in the form of named arguments from their auto-completing IDE, rather than requiring a search through the entire from guidance import ... module to find the function/class they need

If I wanted to effectively set a more 'global' parameter, I might do something like below:

from functools import partial
zero_temp_gen = partial(gen, temperature=0.0)
lm += zero_temp_gen("Say something factual")

wjn0 commented 4 months ago

Thanks for that insight! I agree, I think greedy decoding is a very sensible and intuitive default (although sadly does not seem to quite hold for all models as of now, see #876). Beam search or similar could be a nice-to-have for certain applications, but I don't really need it for any of my current use cases from what I can tell. Similarly, I like the simplicity of the low-level guidance interface -- I wouldn't want it to default to modifying my prompt in potentially opaque ways.

For temperature settings, I think the per-atomic function approach is best, too. A fifth option would be to allow configuration-until-configuration-updated, i.e., setting temperature/other params on the fly in a separate atomic function:

lm += set_generation_opts(temperature=0.2)
# temperature is now 0.2 until it's changed again

Even as it stands now, a section on how temperature, top_*, etc. are used in the docs would be great for folks coming from raw transformers. (I spent quite a bit of time wondering whether #876 could be caused by these parameters.) Either way, very grateful to see you guys thinking about user experience to the extent that you are! Cheers!

AidanCooper commented 4 months ago

Thanks for linking me to this interesting thread, @parkervg!

If you change your prompts to tell the model what its choices are ahead of time -- even abstractly -- you'll generally get much better performance and avoid these artifacts of greedy decoding for most capable models.

This is certainly true. However, I have found for complex tasks with less-capable models, greedy select behaviour can produce poor results, even when providing full context on the constraints.

I think SGLang's approach of evaluating the full set of choices in their entirety makes a lot of sense here (although I have observed some strange behaviour with their implementation — possibly some sort of token healing artefact). The latency penalty should be pretty minimal for realistic sets of select options. For select specifically, only supporting greedy token selection is quite prohibitive for smaller models that aren't so great at instruction following, unless the options are very predictable (task-wise and first-token-wise).

API idea (4) would be a great way to support other sampling strategies beyond the greedy default for individual select constructs!