Open wjn0 opened 4 months ago
That's interesting. Is this happening with other models too?
I'm really only familiar with the Llama family of models, but Phi 2 does not seem to display the same behaviour, at least with this example.
However, it does do something else which is weird. It generates extraneous whitespace when using gen
after select
(I would expect the generations to match token-for-token given that the greedy decoding in select
produces the same first few tokens as gen
):
Code
Output with select
used for the first property name + gen
for the rest
Instruct: Generate a JSON structure representing a book with the following properties: title, author, and publication date. The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925. Generate the author property first.
Output: ```json
{
"title": "The Great Gatsby",
"author": "F. Scott Fitzgerald",
"publication_date": "1925"
}
Output with gen
for the whole thing
Instruct: Generate a JSON structure representing a book with the following properties: title, author, and publication date. The book is called 'The Great Gatsby', written by F. Scott Fitzgerald, and was published in 1925. Generate the author property first.
Output: ```json
{
"title": "The Great Gatsby",
"author": "F. Scott Fitzgerald",
"publication_date": "1925"
}
If there's a model family you'd like me to try, feel free to let me know and I'll poke at it some more.
The select
vs gen
discrepancy happens with:
unsloth/llama-3-70b-Instruct-bnb-4bit
meta-llama/meta-llama-3-8B-Instruct
(so it's not internal unsloth
modifications causing issues, at least)The whitespace issue happens with:
microsoft/phi-2
I've also tried use_fast=False
on the tokenizer just in case and that doesn't seem to do anything.
And here are my experiments that suggest that this relates to the need to heal tokens around the "boundaries" between prompting and generation:
>>> tokenizer.encode(prompt + "{\n ")
[128000, 128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2363, 2038, 14143, 13, 20400, 264, 4823, 6070, 14393, 264, 2363, 449, 279, 2768, 6012, 25, 2316, 11, 3229, 11, 323, 17009, 2457, 13, 20400, 279, 3229, 3424, 1176, 13, 128009, 128006, 882, 128007, 271, 791, 2363, 374, 2663, 364, 791, 8681, 480, 36614, 518, 5439, 555, 435, 13, 10016, 62314, 11, 323, 574, 4756, 304, 220, 5926, 20, 13, 128009, 128006, 78191, 128007, 271, 517, 256]
>>> tokenizer.encode(prompt + "{\n \"author\"")
[128000, 128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2363, 2038, 14143, 13, 20400, 264, 4823, 6070, 14393, 264, 2363, 449, 279, 2768, 6012, 25, 2316, 11, 3229, 11, 323, 17009, 2457, 13, 20400, 279, 3229, 3424, 1176, 13, 128009, 128006, 882, 128007, 271, 791, 2363, 374, 2663, 364, 791, 8681, 480, 36614, 518, 5439, 555, 435, 13, 10016, 62314, 11, 323, 574, 4756, 304, 220, 5926, 20, 13, 128009, 128006, 78191, 128007, 271, 517, 220, 330, 3170, 1]
>>> tokenizer.encode(prompt + "{\n \"title\"")
[128000, 128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2363, 2038, 14143, 13, 20400, 264, 4823, 6070, 14393, 264, 2363, 449, 279, 2768, 6012, 25, 2316, 11, 3229, 11, 323, 17009, 2457, 13, 20400, 279, 3229, 3424, 1176, 13, 128009, 128006, 882, 128007, 271, 791, 2363, 374, 2663, 364, 791, 8681, 480, 36614, 518, 5439, 555, 435, 13, 10016, 62314, 11, 323, 574, 4756, 304, 220, 5926, 20, 13, 128009, 128006, 78191, 128007, 271, 517, 220, 330, 2150, 1]
LlamaCpp
produces the same results as transformers
.
Thanks for the extra information. This is odd. Given that the only difference is changing from gen()
to select()
I would expect any token healing to be working the same way.
Yes, I think you're right! Here's an example where it still happens (this time, a discrepancy between two gen
calls with overlapping prompts), even without token healing being a factor (see the tokenization two replies up). When a leading quote is not provided, a property named author
is generated, when it is, a property named _author
is generated.
A quick note on severity/impact here -- I'm seeing this bug more often than not in my current project. Various workarounds (which amount to a kind of manual token healing as in the above example) are a bit finicky, so for my use case, this precludes me from using guidance
. I'm transitioning to another todo item of mine for the next little while, but when I return I'll likely either (a) attempt bughunting in guidance
, or (b) roll my own. In either case, I might learn something I can report back.
I know you guys are super busy, but if you or the rest of the guidance team have any tips for how I might approach (a) -- mainly where to start in grokking the internals of the library -- they'd be appreciated :) rolling my own would feel a bit silly. I'd ideally like to step through wherever guidance is doing tokenization with a debugger if you can point me in that direction.
Cheers and thanks again!
If you're really after JSON generation, then may I suggest our recently released JSON support: https://guidance.readthedocs.io/en/latest/generated/guidance.json.html#guidance.json
Thank you for all the analysis you have done so far!
@wjn0 I want to take a closer look at what's going on behind the scenes RE: token healing, but in the meantime I second @riedgar-ms's suggestion to look at the built-in JSON support. Would you let us know if it behaves as you expect in this situation? Also +1 to the thank you!
Thanks folks -- yes, absolutely, I love the JSON option when it's available to me. I haven't checked it for this minimal reproducible example (because it's just that for me -- a reduction of some stuff I was seeing in the real world), but I suspect it's OK based on earlier experimentation.
@hudson-ai @riedgar-ms In practice, I'm working with massive JSON schemas (several megabytes plain text w/ nested schemas) that seem to be far too large to compile to regexes (even w/ nested schemas removed + some preprocessing), so the hybrid template/structured generation approach has been ideal for me thus far, hence the example. Cheers again 👍🏻
Thanks folks -- yes, absolutely, I love the JSON option when it's available to me. I haven't checked it for this minimal reproducible example (because it's just that for me -- a reduction of some stuff I was seeing in the real world), but I suspect it's OK based on earlier experimentation.
@hudson-ai @riedgar-ms In practice, I'm working with massive JSON schemas (several megabytes plain text w/ nested schemas) that seem to be far too large to compile to regexes (even w/ nested schemas removed + some preprocessing), so the hybrid template/structured generation approach has been ideal for me thus far, hence the example. Cheers again 👍🏻
Thanks folks -- yes, absolutely, I love the JSON option when it's available to me. I haven't checked it for this minimal reproducible example (because it's just that for me -- a reduction of some stuff I was seeing in the real world), but I suspect it's OK based on earlier experimentation.
@hudson-ai @riedgar-ms In practice, I'm working with massive JSON schemas (several megabytes plain text w/ nested schemas) that seem to be far too large to compile to regexes (even w/ nested schemas removed + some preprocessing), so the hybrid template/structured generation approach has been ideal for me thus far, hence the example. Cheers again 👍🏻
We would be interested in hearing how the JSON support performs with large schema.
I have been doing a little digging, to which end I've created the following test case:
from guidance import models, select, gen, system, assistant, user
def prepare_model(lm: models.Model):
with system():
lm += "You are a book information generator. Respond with \"author\" or \"title\" followed by the value."
with user():
lm += "The book is called 'The Great Gatsby', written by F. Scott Fitzgerald"
return lm
def test_with_gen(selected_model: models.Model):
lm = prepare_model(selected_model)
with assistant():
lm += gen(max_tokens=100)
print(lm)
assert str(lm) == "Hello"
def test_with_select(selected_model: models.Model):
lm = prepare_model(selected_model)
with assistant():
lm += select(["\"author\"", "\"title\""]) + gen(max_tokens=100)
print(lm)
assert str(lm) == "Hello"
The two tests obviously fail, but when running in the debugger I can break in _transformers.py::get_logits
and take a look at the tokens passed in (for speed, I'm using GPT2)
They are identical, except that the select()
variant has an extra "
token appended. This is as it should be - both options to select()
start with a double quote, so that can be inserted automatically. But that means that the actual prompts sent into the model are different.
Have just persuaded Phi3 to work with this (see #885 ), and the final LLM state is:
gen()
<|user|>Convert the following information into JSON with keys 'author' and 'title'. Put the title first. The book is 'The Great Gatsby' by Fitzgerald.<|end|><|assistant|>{
"title": "The Great Gatsby",
"author": "Fitzgerald"
}<|end|>
and
select()
<|user|>Convert the following information into JSON with keys 'author' and 'title'. Put the title first. The book is 'The Great Gatsby' by Fitzgerald.<|end|><|assistant|>{
"title": "The Great Gatsby",
"author": "Fitzgerald"
}<|end|>
so, same JSON document, but different spacing.
Digging into the first call to get_logits()
, I get:
gen():
Last tokens: [29915, 491, 22963, 914, 2741, 29889, 32007, 32001, 29912, 13] forced_bytes=b''
and
select()
Last tokens: [29915, 491, 22963, 914, 2741, 29889, 32007, 32001, 29912, 13] forced_bytes=b'"'
so, while the tokens being sent are the same, again select()
is constraining the output so that it has to start with a double quote (as we should expect).
Ok, neat! I had assumed that guidance internally was doing something similar to outlines w.r.t. regex-based JSON generation, but after looking at the code more closely it looks like the guidance implementation is "lazier" and therefore might actually work for my usecase! Super exciting, will give it a shot.
I'll take a closer look at the tokenization issue with fresh eyes as well. Thanks for the pointers on that part of the library, I'll drill down there when I next get the chance.
If you have large JSON schema, we'd be really interested to know how Guidance performs. We only have functionality tests; we've not really tried to stress our implementation.
Do let us know if there are gaps in the implementation too.
@riedgar-ms @hudson-ai So, I gave it a shot. wjn0/guidance@improve-json-schema-support contains a few hackish changes that I required for my schema. These are not legitimate fixes (i.e. you wouldn't want these as PRs:) so I've created feature request issues for each:
with some discussion items. Sadly, although this prevents errors, it still (understandably) hangs on a huge schema (hopefully not due to bugs I've introduced with my hackish fixes:'). Therefore, I've also created:
with some thoughts on strategy, along with an example. There's still some gaps compared to my templating approach I've been working on mentioned above ^ that I've created as separate issues:
My intention here is not to drown the repo in feature requests or issues, but it's certainly moved beyond #876 alone and I think this is a reasonable breakdown. Obviously, please feel fee to close/consolidate as appropriate. This would cover my use case (and the root case of the ticket here). Cheers!
@wjn0 please don't feel self-conscious about inundating us with issues -- I think all the issues you've opened represent really valid requests and start good discussions. We appreciate your engagement :)
The bug I have a minimal reproducible example where I would expect
select
andgen
to produce similar results, but they don't. My experimentation suggests maybe a tokenization or token healing issue, but I'm not sure. If the behaviour is expected, it would be useful to have some documentation to better understand why.To Reproduce
With
select
, the next output is"title"
("wrong" in a certain sense) while with unconstrained generation the output is"author"
("correct" in a certain sense).(1)
gen
output:(2)
select
output:System info (please complete the following information):
guidance.__version__
): 0.1.15