Select generates incorrect output

d-01 commented 8 months ago

The bug select generates incorrect output, while unconstrained generation (gen) is correct.

Description It seems the problem somehow connected to 8-bit quantization, because I was unable to reproduce the problem in other modes (fp16 non-quantized and 4-bit quantized).

Given this kind of prompt:

<BOS>

### Human:
Memorize and repeat only one item from this list:
001 = Ni
002 = aV
...
020 = wq
021 = _j
...

### Assistant:
021 =<EOS>

the expected output (ground-truth) is obviously: _j.

Free (unconstrained) generation gives correct answer:

lm + prompt + gen(stop='\n', max_tokens=400)  # -> " _j"

If this is a greedy decoding, then it can be deduced that {'▁_': 903} is the token with the maximum probability.

But guided (constrained) generation in this case gives wrong answer, which in theory should be impossible, given most probable token must be the same in both cases:

lm + prompt + select([' Ni\n', ' aV\n', ' Dm\n', ...  # -> " wq\n"

How the token {'▁w': 281} was chosen over the token {'▁_': 903} with the maximum probability?

Note that token healing should not be involved, since the prompt ends with a valid (non-partial) token {'▁=': 353}.

Probably related issue: https://github.com/guidance-ai/guidance/issues/472

To Reproduce

import re

from guidance import gen, select, models

import torch
from transformers import LlamaTokenizer, LlamaForCausalLM

MODEL_PATH = '/home/user/work/hf-models/lmsys/vicuna-13b-v1.3'

tokenizer = LlamaTokenizer.from_pretrained(MODEL_PATH, use_fast=False)

model = LlamaForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.float16,
    load_in_8bit=True,
    load_in_4bit=False,
    device_map="auto",
)

lm = models.Transformers(model, tokenizer, echo=True, caching=False)
lm.echo = False

# Prompt, gt (expected output)
dataset = [('\n\n### Human:\nMemorize and repeat only one item from this list:\n001 = Ni\n002 = aV\n003 = Dm\n004 = IM\n005 = cD\n006 = ri\n007 = hs\n008 = E1\n009 = 1O\n010 = go\n011 = 9s\n012 = eV\n013 = 2F\n014 = KS\n015 = rI\n016 = c9\n017 = 8P\n018 = kP\n019 = Xh\n020 = wq\n021 = _j\n022 = 4M\n023 = WV\n024 = Kk\n025 = eU\n026 = HE\n027 = T9\n028 = Xk\n029 = _2\n030 = fO\n031 = uV\n032 = KH\n033 = bt\n034 = R_\n035 = 8K\n036 = uf\n037 = fj\n038 = 9w\n039 = fe\n040 = C0\n041 = Aa\n042 = Tn\n043 = 0J\n044 = nL\n045 = Lw\n046 = bq\n047 = NN\n048 = L0\n049 = m6\n050 = jK\n\n### Assistant:\n021 =', '_j'),
 ('\n\n### Human:\nMemorize and repeat only one item from this list:\n001 = gK\n002 = 0n\n003 = B7\n004 = vR\n005 = Sg\n006 = D1\n007 = eV\n008 = fW\n009 = ws\n010 = pI\n011 = IH\n012 = 2A\n013 = iD\n014 = 9j\n015 = oX\n016 = Vq\n017 = t6\n018 = e6\n019 = Kr\n020 = wG\n021 = W3\n022 = Kz\n023 = X4\n024 = En\n025 = _Z\n026 = rd\n027 = dL\n028 = M5\n029 = Vw\n030 = qn\n031 = IN\n032 = 40\n033 = _C\n034 = Zf\n035 = wN\n036 = Up\n037 = Tv\n038 = go\n039 = gf\n040 = OM\n041 = uA\n042 = RA\n043 = qm\n044 = LP\n045 = ZK\n046 = rK\n047 = WQ\n048 = Vc\n049 = xQ\n050 = qN\n\n### Assistant:\n012 =', '2A')]

for prompt, gt in dataset:
    options = re.findall(r'\d\d\d =( \S\S\n)', prompt)  # [' Ni\n', ' aV\n', ' Dm\n', ...]
    name = 'output'

    gen_free = lm + prompt + gen(name=name, stop='\n', max_tokens=400)
    gen_select = lm + prompt + select(options, name=name)

    print(prompt)
    print(f'Expected (gt): "{gt}"')
    print(f'Free generation (gen): "{gen_free[name].strip()}"')
    print(f'Guided generation (select): "{gen_select[name].strip()}"')
    print('-' * 80)

System info (please complete the following information): Environment:

OS: Ubuntu 18.04.6 LTS
LLM: https://huggingface.co/lmsys/vicuna-13b-v1.3
Quantization: 8-bit
GPU: NVIDIA Tesla V100-PCIE-32GB
NVIDIA Driver Version: 470.57.02
CUDA Version: 11.7
nvcc: release 11.7, V11.7.64
Python: 3.8.17
Libs:
- accelerate==0.22.0
- bitsandbytes==0.41.1
- guidance==0.1.5
- numpy==1.24.4
- nvidia-cublas-cu11==11.10.3.66
- nvidia-cuda-nvrtc-cu11==11.7.99
- nvidia-cuda-runtime-cu11==11.7.99
- nvidia-cudnn-cu11==8.5.0.96
- sentencepiece==0.1.99
- tokenizers==0.13.3
- torch==1.13.0
- transformers==4.33.1

slundberg commented 8 months ago

Thanks! Will look into this. Do you happen to have llama.cpp installed as well? If so it would be nice to know if the same issue happens there or seems to be specific to HuggingFace.

Update: I tried this with Mistral 7B with no issues. Long story, but I have constraints that prevent me from running the llama based vicuna version right now, if you find a similar issue with a non-llama model that would also be helpful.

d-01 commented 8 months ago

Thanks for your comment on this issue.

I conducted the experiments on the same prompts with llama.cpp and there was no issue described earlier;
GPT2 (small) - no issues also.

Code for llama.cpp experiment:

import guidance
from guidance import gen, select, substring, models

# https://huggingface.co/TheBloke/vicuna-13B-v1.5-GGUF/blob/main/vicuna-13b-v1.5.Q8_0.gguf
MODEL_PATH = '/home/user/work/hf-models/vicuna-13b-v1.5.Q8_0.gguf'

lm = models.LlamaCpp(MODEL_PATH, n_gpu_layers=-1, n_ctx=2048)

def complete(prompt, options=[]):
    if options:
        options = [' ' + x + '\n' for x in options]
        generation = lm + prompt + select(options, name='output')
    else:
        generation = lm + prompt + gen(name='output', stop='\n', max_tokens=400, temperature=0)
    output = generation.get('output', '[ERROR] "output" variable is missing').strip()
    return output

...

guidance-ai / guidance

Select generates incorrect output #546