guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.72k stars 1.03k forks source link

Select over long texts #472

Open CorentinvdBdO opened 10 months ago

CorentinvdBdO commented 10 months ago

Hey thanks for your work, guidance is amazing to work with now!

I had a question regarding select. As I understand it is simply a bias of logprobs over the tokens. Also, LLMs don't predict blindly token by token but instead build a "tree" of n tokens and use the most likely one. So does select work in the same manner? i.e. does it work reliably on long options, for example choose between different sentences that may start the same way.

To be clearer, if we represent an option as a list of tokens: option A: [token_1, .... token_n] The logprob associated: p(token_1), p(token_1 & token_2), .....p(token_1 & ...... & token_n) written as p_1, .... p_n

I want to select solely on p_n , and not select on p_1, then look for token2, etc... I believe this should allow for selecting over long texts while removing the first tokens bias.

Please tell me if I misunderstood something. Thanks again for your key contribution to open research!

kddubey commented 9 months ago

I believe this should allow for selecting over long texts while removing the first tokens bias.

This is definitely an important thing to think about. Luckily, most language models are good enough that simply mentioning the options in the prompt can overcome the bias. So instead of doing—

prompt = "Make the right choice:"
options = [
    " the first option",
    " the second option",
    " the final option",
]
gpt + prompt + select(options)

—do:

options_as_str = "\n".join(options)
prompt_with_options = f"""Here are your options:
{options_as_str}

{prompt}"""

print(prompt_with_options)
# Here are your options:
#  the first option
#  the second option
#  the final option
# 
# Make the right choice:

gpt + prompt_with_options + select(options)

Mentioning the options in the prompt causes the model to allocate a lot of probability to the first token of the correct option, and (after that token) it allocates a lot of probability to the second token of the correct option, and so on.

This prompting style costs more computation because there are more tokens in the context. But you'll almost always see an accuracy boost.

Another slight hack that works for heavily instruction-trained models: turn the task into a multiple choice question, i.e., point to each option (a long sentence) with a single letter. My notebook here demonstrates this strategy. Note that this prompt format may not work well if there are more than 5 options, because multiple choice question formats seen during training are usually limited to the letters from school exams: A, B, C, D, E.

i.e. does it work reliably on long options, for example choose between different sentences that may start the same way.

My notebook here evaluated guidance.select on a task where there are 77 multi-token options. It performs only slightly worse than other methods, and is significantly faster. The options in this task aren't full sentences. But hopefully the experiment gives you some confidence in the guidance.select algorithm.

At a high level, the algorithm should work well even if some or all sentences start with the same prefix.

⚠️ Take the info below w/ a big grain of salt. I haven't read the actual algorithm yet. This is just how I think about it.

Say there are 3 options which start with prefix, and we've reached that node in the tree:

     prefix
    /  |   \
  s1   s2  s3

s{i} is the first token of the suffix of the ith option. From here, pick the suffix child token (s1, s2, or s3) with the highest conditional probability/logit (given the prompt + prefix). If this suffix child token has no children (or all of its children have at most 1 child), we're done; the option is unambiguous. Else, recurse on that child.

All this to say that the algorithm distinguishes between options with the same prefix by picking the option whose first suffix token is more likely. This heuristic is pretty in line with how greedy decoding in text generation works, so it should work pretty well.

I want to select solely on p_n

Interesting idea. I'm not sure this works well though, as p_n might be really high simply because token_n is really probable given tokens 1, 2, . . ., n-1 (regardless of the prompt!). In case you'd like to explore a more holistic method, check out (my project) CAPPr. It takes the average log-probability of tokens 1 through n. It also has functionality to let you explore the argmax p_n idea. Feel free to open an issue there if you need help :-)

CorentinvdBdO commented 9 months ago

Yeah I use the same technique of passing the possible solutions first. I'd need to test when the number of options is too long and whether it's better to use a number, a letter or nothing at all. But I need a pre sort anyway, because I deal with thousands of options.

CorentinvdBdO commented 9 months ago

And regarding the p_n it's not possible using logits a you'd need to compute way too many generations making the process way too long. I think One would need to put their hands directly in the model but we lose the relative model independence of Guidance...

CorentinvdBdO commented 9 months ago

An exemple where guidance select breaks down:

### Instruction:
You are assisting the customer service team.
Your task is to extract the following information from the customer's input:
- The brand of the store.
You must format your response in valid JSON. If any information is missing, write "N/A" instead.

### Input:
I'm going to DOLCE&GABBANA

### Response:
{
        BRAND: "DOLPHIN

When selecting between : ["D&G", "Dolce & Gabbana", "DOLPHIN", "N/A"] on "TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ"

The model doesn't matter, this type of behaviour is seen all across models from 7B to 13B, quantized or not.

The subtlety here is that the models tries to generate in all caps and ends up with DOLPHIN thanks the the very high probability given to the first tokens.

The generation can be correct with smalls changes to the caps and/or the spacing around &. This is not optimal, except if you want to work only in lower case. These changes should be literally meaningless, not a change in what the string means.

The generation works if you give the selection in the prompt (like "select from these brands: "). This is again not optimal as some of use work with hundreds or thousands of options.

CarloNicolini commented 6 months ago

In my experiments I've been playing with job type selection. Currently, apart from the very very slow behaviour when providing the entire job list as a concatenated string to choose from (I'm using Mixtral 8x7b - 5bit) wih 4 A100 gpus, I see that the results aren't good. It seems that the selection focuses on the very last options. Is it a problem with context size?