guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.63k stars 1.03k forks source link

Select generates incorrect output #546

Open d-01 opened 8 months ago

d-01 commented 8 months ago

The bug select generates incorrect output, while unconstrained generation (gen) is correct.

Description It seems the problem somehow connected to 8-bit quantization, because I was unable to reproduce the problem in other modes (fp16 non-quantized and 4-bit quantized).

Given this kind of prompt:

<BOS>

### Human:
Memorize and repeat only one item from this list:
001 = Ni
002 = aV
...
020 = wq
021 = _j
...

### Assistant:
021 =<EOS>

Free (unconstrained) generation gives correct answer:

lm + prompt + gen(stop='\n', max_tokens=400)  # -> " _j"

But guided (constrained) generation in this case gives wrong answer, which in theory should be impossible, given most probable token must be the same in both cases:

lm + prompt + select([' Ni\n', ' aV\n', ' Dm\n', ...  # -> " wq\n"

Note that token healing should not be involved, since the prompt ends with a valid (non-partial) token {'▁=': 353}.

Probably related issue: https://github.com/guidance-ai/guidance/issues/472

To Reproduce

import re

from guidance import gen, select, models

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM

MODEL_PATH = '/home/user/work/hf-models/lmsys/vicuna-13b-v1.3'

tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH, use_fast=False)

model = LlamaForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.float16,
    load_in_8bit=True,
    load_in_4bit=False,
    device_map="auto",
)

lm = models.Transformers(model, tokenizer, echo=True, caching=False)
lm.echo = False

# Prompt, gt (expected output)
dataset = [('\n\n### Human:\nMemorize and repeat only one item from this list:\n001 = Ni\n002 = aV\n003 = Dm\n004 = IM\n005 = cD\n006 = ri\n007 = hs\n008 = E1\n009 = 1O\n010 = go\n011 = 9s\n012 = eV\n013 = 2F\n014 = KS\n015 = rI\n016 = c9\n017 = 8P\n018 = kP\n019 = Xh\n020 = wq\n021 = _j\n022 = 4M\n023 = WV\n024 = Kk\n025 = eU\n026 = HE\n027 = T9\n028 = Xk\n029 = _2\n030 = fO\n031 = uV\n032 = KH\n033 = bt\n034 = R_\n035 = 8K\n036 = uf\n037 = fj\n038 = 9w\n039 = fe\n040 = C0\n041 = Aa\n042 = Tn\n043 = 0J\n044 = nL\n045 = Lw\n046 = bq\n047 = NN\n048 = L0\n049 = m6\n050 = jK\n\n### Assistant:\n021 =', '_j'),
 ('\n\n### Human:\nMemorize and repeat only one item from this list:\n001 = gK\n002 = 0n\n003 = B7\n004 = vR\n005 = Sg\n006 = D1\n007 = eV\n008 = fW\n009 = ws\n010 = pI\n011 = IH\n012 = 2A\n013 = iD\n014 = 9j\n015 = oX\n016 = Vq\n017 = t6\n018 = e6\n019 = Kr\n020 = wG\n021 = W3\n022 = Kz\n023 = X4\n024 = En\n025 = _Z\n026 = rd\n027 = dL\n028 = M5\n029 = Vw\n030 = qn\n031 = IN\n032 = 40\n033 = _C\n034 = Zf\n035 = wN\n036 = Up\n037 = Tv\n038 = go\n039 = gf\n040 = OM\n041 = uA\n042 = RA\n043 = qm\n044 = LP\n045 = ZK\n046 = rK\n047 = WQ\n048 = Vc\n049 = xQ\n050 = qN\n\n### Assistant:\n012 =', '2A')]

for prompt, gt in dataset:
    options = re.findall(r'\d\d\d =( \S\S\n)', prompt)  # [' Ni\n', ' aV\n', ' Dm\n', ...]
    name = 'output'

    gen_free = lm + prompt + gen(name=name, stop='\n', max_tokens=400)
    gen_select = lm + prompt + select(options, name=name)

    print(prompt)
    print(f'Expected (gt): "{gt}"')
    print(f'Free generation (gen): "{gen_free[name].strip()}"')
    print(f'Guided generation (select): "{gen_select[name].strip()}"')
    print('-' * 80)

System info (please complete the following information): Environment:

slundberg commented 8 months ago

Thanks! Will look into this. Do you happen to have llama.cpp installed as well? If so it would be nice to know if the same issue happens there or seems to be specific to HuggingFace.

Update: I tried this with Mistral 7B with no issues. Long story, but I have constraints that prevent me from running the llama based vicuna version right now, if you find a similar issue with a non-llama model that would also be helpful.

d-01 commented 8 months ago

Thanks for your comment on this issue.

  1. I conducted the experiments on the same prompts with llama.cpp and there was no issue described earlier;
  2. GPT2 (small) - no issues also.

Code for llama.cpp experiment:

import guidance
from guidance import gen, select, substring, models

# https://huggingface.co/TheBloke/vicuna-13B-v1.5-GGUF/blob/main/vicuna-13b-v1.5.Q8_0.gguf
MODEL_PATH = '/home/user/work/hf-models/vicuna-13b-v1.5.Q8_0.gguf'

lm = models.LlamaCpp(MODEL_PATH, n_gpu_layers=-1, n_ctx=2048)

def complete(prompt, options=[]):
    if options:
        options = [' ' + x + '\n' for x in options]
        generation = lm + prompt + select(options, name='output')
    else:
        generation = lm + prompt + gen(name='output', stop='\n', max_tokens=400, temperature=0)
    output = generation.get('output', '[ERROR] "output" variable is missing').strip()
    return output

...