`select` produces different results than `gen`, even though the maximum likelihood answer should be the same (tokenization/token healing issue?)

wjn0 commented 4 months ago

The bug I have a minimal reproducible example where I would expect select and gen to produce similar results, but they don't. My experimentation suggests maybe a tokenization or token healing issue, but I'm not sure. If the behaviour is expected, it would be useful to have some documentation to better understand why.

To Reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer

import guidance

print("guidance version: ", guidance.__version__)

model_name = "unsloth/llama-3-70b-Instruct-bnb-4bit"
device = "cuda"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
lm = guidance.models.Transformers(model=model, tokenizer=tokenizer)

messages = [
    {"role": "system",
     "content": "You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first."},
    {"role": "user",
     "content": "The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925."},
]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

lm += prompt
lm += "{\n  "
# lm += guidance.select(["\"author\"", "\"title\""]) + guidance.gen(max_tokens=10)
lm += guidance.gen(max_tokens=10)
print(lm)

With select, the next output is "title" ("wrong" in a certain sense) while with unconstrained generation the output is "author" ("correct" in a certain sense).

(1) gen output:

``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first.<|eot_id|><|start_header_id|>user<|end_header_id|> The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925.<|eot_id|><|start_header_id|>assistant<|end_header_id|> { "author": "F. Scott Fitzgerald", ```

(2) select output:

``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first.<|eot_id|><|start_header_id|>user<|end_header_id|> The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925.<|eot_id|><|start_header_id|>assistant<|end_header_id|> { "title": "The Great Gatsby", "author ```

System info (please complete the following information):

OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): RHEL 9
Guidance Version (guidance.__version__): 0.1.15

riedgar-ms commented 4 months ago

That's interesting. Is this happening with other models too?

wjn0 commented 4 months ago

I'm really only familiar with the Llama family of models, but Phi 2 does not seem to display the same behaviour, at least with this example.

However, it does do something else which is weird. It generates extraneous whitespace when using gen after select (I would expect the generations to match token-for-token given that the greedy decoding in select produces the same first few tokens as gen):

Code

```python from transformers import AutoModelForCausalLM, AutoTokenizer, PretrainedConfig import guidance print("guidance version: ", guidance.__version__) # model_name = "unsloth/llama-3-70b-Instruct-bnb-4bit" model_name = "microsoft/phi-2" device = "cuda" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) lm = guidance.models.Transformers(model=model, tokenizer=tokenizer) if "llama" in model_name: messages = [ {"role": "system", "content": "You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first."}, {"role": "user", "content": "The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925."}, ] prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) elif "phi" in model_name: prompt = "Instruct: Generate a JSON structure representing a book with the following properties: title, author, and publication date. The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925. Generate the author property first." prompt += "\nOutput: ```json\n" lm += prompt # print(lm) lm += "{\n " # lm += guidance.select(["\"author\"", "\"title\""]) + guidance.gen(max_tokens=100) lm += guidance.gen(max_tokens=100) print(lm) ```

Output with select used for the first property name + gen for the rest

Instruct: Generate a JSON structure representing a book with the following properties: title, author, and publication date. The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925. Generate the author property first.
Output: ```json
{
  "title": "The Great Gatsby",

  "author": "F. Scott Fitzgerald",

  "publication_date": "1925"
}

Output with gen for the whole thing

Instruct: Generate a JSON structure representing a book with the following properties: title, author, and publication date. The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925. Generate the author property first.
Output: ```json
{
  "title": "The Great Gatsby",
  "author": "F. Scott Fitzgerald",
  "publication_date": "1925"
}

wjn0 commented 4 months ago

If there's a model family you'd like me to try, feel free to let me know and I'll poke at it some more.

wjn0 commented 4 months ago

The select vs gen discrepancy happens with:

L3: unsloth/llama-3-70b-Instruct-bnb-4bit
L3: meta-llama/meta-llama-3-8B-Instruct (so it's not internal unsloth modifications causing issues, at least)

The whitespace issue happens with:

Phi: microsoft/phi-2

I've also tried use_fast=False on the tokenizer just in case and that doesn't seem to do anything.

wjn0 commented 4 months ago

And here are my experiments that suggest that this relates to the need to heal tokens around the "boundaries" between prompting and generation:

>>> tokenizer.encode(prompt + "{\n  ")
[128000, 128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2363, 2038, 14143, 13, 20400, 264, 4823, 6070, 14393, 264, 2363, 449, 279, 2768, 6012, 25, 2316, 11, 3229, 11, 323, 17009, 2457, 13, 20400, 279, 3229, 3424, 1176, 13, 128009, 128006, 882, 128007, 271, 791, 2363, 374, 2663, 364, 791, 8681, 480, 36614, 518, 5439, 555, 435, 13, 10016, 62314, 11, 323, 574, 4756, 304, 220, 5926, 20, 13, 128009, 128006, 78191, 128007, 271, 517, 256]
>>> tokenizer.encode(prompt + "{\n  \"author\"")
[128000, 128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2363, 2038, 14143, 13, 20400, 264, 4823, 6070, 14393, 264, 2363, 449, 279, 2768, 6012, 25, 2316, 11, 3229, 11, 323, 17009, 2457, 13, 20400, 279, 3229, 3424, 1176, 13, 128009, 128006, 882, 128007, 271, 791, 2363, 374, 2663, 364, 791, 8681, 480, 36614, 518, 5439, 555, 435, 13, 10016, 62314, 11, 323, 574, 4756, 304, 220, 5926, 20, 13, 128009, 128006, 78191, 128007, 271, 517, 220, 330, 3170, 1]
>>> tokenizer.encode(prompt + "{\n  \"title\"")
[128000, 128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2363, 2038, 14143, 13, 20400, 264, 4823, 6070, 14393, 264, 2363, 449, 279, 2768, 6012, 25, 2316, 11, 3229, 11, 323, 17009, 2457, 13, 20400, 279, 3229, 3424, 1176, 13, 128009, 128006, 882, 128007, 271, 791, 2363, 374, 2663, 364, 791, 8681, 480, 36614, 518, 5439, 555, 435, 13, 10016, 62314, 11, 323, 574, 4756, 304, 220, 5926, 20, 13, 128009, 128006, 78191, 128007, 271, 517, 220, 330, 2150, 1]

wjn0 commented 4 months ago

LlamaCpp produces the same results as transformers.

```python from transformers import AutoModelForCausalLM, AutoTokenizer, PretrainedConfig import guidance from llama_cpp import Llama print("guidance version: ", guidance.__version__) model_name = "meta-llama/meta-llama-3-8B-Instruct" # only for formatting the chat tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False) model_name = "bartowski/Meta-Llama-3-70B-Instruct-GGUF" llm = Llama.from_pretrained(model_name, filename="Meta-Llama-3-70B-Instruct-Q5_K_M.gguf", n_gpu_layers=-1) lm = guidance.models.LlamaCpp(llm) if "llama" in model_name or "vicuna" in model_name or "aya" in model_name or "Llama" in model_name: messages = [ {"role": "system", "content": "You are a book information generator. Generate a JSON structure representing a book with the following properties: title, author, and publication date. Generate the author property first."}, {"role": "user", "content": "The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925."}, ] prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) elif "phi" in model_name: prompt = "Instruct: Generate a JSON structure representing a book with the following properties: title, author, and publication date. The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925. Generate the author property first." prompt += "\nOutput: ```json\n" lm += prompt lm += "{\n " + guidance.gen(max_tokens=100, name="generation") # lm += "{\n " + guidance.select(["\"title\"", "\"author\""], name="generation") + guidance.gen(max_tokens=100, name="generation2") print(lm) ```

riedgar-ms commented 3 months ago

Thanks for the extra information. This is odd. Given that the only difference is changing from gen() to select() I would expect any token healing to be working the same way.

wjn0 commented 3 months ago

Yes, I think you're right! Here's an example where it still happens (this time, a discrepancy between two gen calls with overlapping prompts), even without token healing being a factor (see the tokenization two replies up). When a leading quote is not provided, a property named author is generated, when it is, a property named _author is generated.

```python from transformers import AutoModelForCausalLM, AutoTokenizer import guidance from test_tokenization import print_tokens print("guidance version: ", guidance.__version__) model_name = "unsloth/llama-3-70b-Instruct-bnb-4bit" device = "cuda" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) lm = guidance.models.Transformers(model=model, tokenizer=tokenizer) messages = [ {"role": "system", "content": "You are a book information generator. Generate a JSON structure representing a book with the following properties: `author`, `title`, and `publication_date`. Generate the author property first. All values should be strings."}, {"role": "user", "content": "The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925."}, ] prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False) total_prompt = prompt + "{\n " # lm += total_prompt lm += total_prompt + '"' lm += guidance.gen(max_tokens=10) print(lm) ```

A quick note on severity/impact here -- I'm seeing this bug more often than not in my current project. Various workarounds (which amount to a kind of manual token healing as in the above example) are a bit finicky, so for my use case, this precludes me from using guidance. I'm transitioning to another todo item of mine for the next little while, but when I return I'll likely either (a) attempt bughunting in guidance, or (b) roll my own. In either case, I might learn something I can report back.

I know you guys are super busy, but if you or the rest of the guidance team have any tips for how I might approach (a) -- mainly where to start in grokking the internals of the library -- they'd be appreciated :) rolling my own would feel a bit silly. I'd ideally like to step through wherever guidance is doing tokenization with a debugger if you can point me in that direction.

Cheers and thanks again!

riedgar-ms commented 3 months ago

If you're really after JSON generation, then may I suggest our recently released JSON support: https://guidance.readthedocs.io/en/latest/generated/guidance.json.html#guidance.json

Thank you for all the analysis you have done so far!

hudson-ai commented 3 months ago

@wjn0 I want to take a closer look at what's going on behind the scenes RE: token healing, but in the meantime I second @riedgar-ms's suggestion to look at the built-in JSON support. Would you let us know if it behaves as you expect in this situation? Also +1 to the thank you!

wjn0 commented 3 months ago

Thanks folks -- yes, absolutely, I love the JSON option when it's available to me. I haven't checked it for this minimal reproducible example (because it's just that for me -- a reduction of some stuff I was seeing in the real world), but I suspect it's OK based on earlier experimentation.

@hudson-ai @riedgar-ms In practice, I'm working with massive JSON schemas (several megabytes plain text w/ nested schemas) that seem to be far too large to compile to regexes (even w/ nested schemas removed + some preprocessing), so the hybrid template/structured generation approach has been ideal for me thus far, hence the example. Cheers again 👍🏻

riedgar-ms commented 3 months ago

Thanks folks -- yes, absolutely, I love the JSON option when it's available to me. I haven't checked it for this minimal reproducible example (because it's just that for me -- a reduction of some stuff I was seeing in the real world), but I suspect it's OK based on earlier experimentation.

@hudson-ai @riedgar-ms In practice, I'm working with massive JSON schemas (several megabytes plain text w/ nested schemas) that seem to be far too large to compile to regexes (even w/ nested schemas removed + some preprocessing), so the hybrid template/structured generation approach has been ideal for me thus far, hence the example. Cheers again 👍🏻

Thanks folks -- yes, absolutely, I love the JSON option when it's available to me. I haven't checked it for this minimal reproducible example (because it's just that for me -- a reduction of some stuff I was seeing in the real world), but I suspect it's OK based on earlier experimentation.

@hudson-ai @riedgar-ms In practice, I'm working with massive JSON schemas (several megabytes plain text w/ nested schemas) that seem to be far too large to compile to regexes (even w/ nested schemas removed + some preprocessing), so the hybrid template/structured generation approach has been ideal for me thus far, hence the example. Cheers again 👍🏻

We would be interested in hearing how the JSON support performs with large schema.

riedgar-ms commented 3 months ago

I have been doing a little digging, to which end I've created the following test case:

from guidance import models, select, gen, system, assistant, user

def prepare_model(lm: models.Model):
    with system():
        lm += "You are a book information generator. Respond with \"author\" or \"title\" followed by the value."
    with user():
        lm += "The book is called 'The Great Gatsby', written by F. Scott Fitzgerald"
    return lm

def test_with_gen(selected_model: models.Model):
    lm = prepare_model(selected_model)
    with assistant():
        lm += gen(max_tokens=100)
    print(lm)
    assert str(lm) == "Hello"

def test_with_select(selected_model: models.Model):
    lm = prepare_model(selected_model)
    with assistant():
        lm += select(["\"author\"", "\"title\""]) + gen(max_tokens=100)
    print(lm)
    assert str(lm) == "Hello"

The two tests obviously fail, but when running in the debugger I can break in _transformers.py::get_logits and take a look at the tokens passed in (for speed, I'm using GPT2)

They are identical, except that the select() variant has an extra " token appended. This is as it should be - both options to select() start with a double quote, so that can be inserted automatically. But that means that the actual prompts sent into the model are different.

riedgar-ms commented 3 months ago

Have just persuaded Phi3 to work with this (see #885 ), and the final LLM state is:

gen()
<|user|>Convert the following information into JSON with keys 'author' and 'title'. Put the title first. The book is 'The Great Gatsby' by Fitzgerald.<|end|><|assistant|>{
  "title": "The Great Gatsby",
  "author": "Fitzgerald"
}<|end|>

and

select()
<|user|>Convert the following information into JSON with keys 'author' and 'title'. Put the title first. The book is 'The Great Gatsby' by Fitzgerald.<|end|><|assistant|>{
"title": "The Great Gatsby",
"author": "Fitzgerald"
}<|end|>

so, same JSON document, but different spacing.

Digging into the first call to get_logits(), I get:

gen():
Last tokens: [29915, 491, 22963, 914, 2741, 29889, 32007, 32001, 29912, 13] forced_bytes=b''

and

select()
Last tokens: [29915, 491, 22963, 914, 2741, 29889, 32007, 32001, 29912, 13] forced_bytes=b'"'

so, while the tokens being sent are the same, again select() is constraining the output so that it has to start with a double quote (as we should expect).

wjn0 commented 3 months ago

Ok, neat! I had assumed that guidance internally was doing something similar to outlines w.r.t. regex-based JSON generation, but after looking at the code more closely it looks like the guidance implementation is "lazier" and therefore might actually work for my usecase! Super exciting, will give it a shot.

I'll take a closer look at the tokenization issue with fresh eyes as well. Thanks for the pointers on that part of the library, I'll drill down there when I next get the chance.

riedgar-ms commented 3 months ago

If you have large JSON schema, we'd be really interested to know how Guidance performs. We only have functionality tests; we've not really tried to stress our implementation.

Do let us know if there are gaps in the implementation too.

wjn0 commented 3 months ago

@riedgar-ms @hudson-ai So, I gave it a shot. wjn0/guidance@improve-json-schema-support contains a few hackish changes that I required for my schema. These are not legitimate fixes (i.e. you wouldn't want these as PRs:) so I've created feature request issues for each:

887
888

with some discussion items. Sadly, although this prevents errors, it still (understandably) hangs on a huge schema (hopefully not due to bugs I've introduced with my hackish fixes:'). Therefore, I've also created:

889

with some thoughts on strategy, along with an example. There's still some gaps compared to my templating approach I've been working on mentioned above ^ that I've created as separate issues:

890
891

My intention here is not to drown the repo in feature requests or issues, but it's certainly moved beyond #876 alone and I think this is a reasonable breakdown. Obviously, please feel fee to close/consolidate as appropriate. This would cover my use case (and the root case of the ticket here). Cheers!

hudson-ai commented 3 months ago

@wjn0 please don't feel self-conscious about inundating us with issues -- I think all the issues you've opened represent really valid requests and start good discussions. We appreciate your engagement :)

guidance-ai / guidance

`select` produces different results than `gen`, even though the maximum likelihood answer should be the same (tokenization/token healing issue?) #876

887

888

889

890

891