Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping

Crista23 commented 1 month ago

My code is throwing the error below:

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/net/scratch/user/miniconda3/envs/vllm_guidance/lib/python3.9/site-packages/guidance/models/transformers/_transformers.py:150: UserWarning: Could not build_byte tokens from the tokenizer by encoding token strings: Round-trip encoding of tokens [!] failed! Got [128000, 0]
 warnings.warn(
Warning: lexer error: too many states: 10406 >= 10000; stopping

I can see this error is thrown in the code here https://github.com/guidance-ai/guidance/blob/main/guidance/models/transformers/_transformers.py#L233 and it looks like it's a tokenizer issue, however I am calling the guidance library without specifying a tokenizer

llm = models.Transformers(args.model_path, device_map="auto", trust_remote_code=True)

I am wondering how to fix this. Any advice appreciated, thanks!

Harsha-Nori commented 1 month ago

Hi @Crista23, sorry you're dealing with this! Which version of the package are you using? Are you on our release candidate / installing from source?

Even if a tokenizer isn't explicitly specified, we do need one for guidance to work properly. For transformers based models, we try to load it automatically from the model config. However, sometimes this can act up, especially if there are new tokens added to a model's vocabulary via fine tuning (and not updated in the config...).

Are you using a public/oss model? Do you mind sharing the link to it so that we can try to debug it on our side?

Crista23 commented 1 month ago

HI @Harsha-Nori , thanks a lot for your answer! I have installed guidance --pre using pip and the version installed is 0.2.0rc1. I am using this in combination with publicly available models such as LLAMA-8B-Instruct instantiated in the code using

llm = models.Transformers(args.model_path, device_map="auto", trust_remote_code=True)

outputs = llm + prompt_eval(item[key])

It has worked for a couple examples until it crashed with this error: "Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping". It looks like a tokenizer issue and even though I tried to replace "!" with the empty string in the input it still fails.

I would appreciate your thoughts on how to fix this, thank you!

Crista23 commented 1 month ago

@Harsha-Nori Any thoughts? Sorry to ask again, it's a pressing issue.

Harsha-Nori commented 1 month ago

Hi @Crista23, I can't seem to replicate this with a llama-8B model :(. Could you share some more details about your code, including the exact huggingface model and/or details of the prompt_eval method?

The error message can happen if the grammar you're constraining against is particularly complex, but I can't seem to replicate it on my side :(. Happy to also collaborate via email if you can't share publicly.

hudson-ai commented 1 month ago

@Crista23 if you can't share details of your prompt, would you be able to share the full traceback? Thanks!

jtbuter commented 1 month ago

@Harsha-Nori @hudson-ai I get a similar warning when initializing the llama 8b instruct model with guidance 0.1.16 and transformers 4.45.2

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import guidance.models
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

llama3 = guidance.models.Transformers(
    "meta-llama/Meta-Llama-3-8B-Instruct",
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

The warning is the following UserWarning: Could not build_byte tokens from the tokenizer by encoding token strings: Round-trip encoding of tokens [!] failed! Got [128000, 0]

Can it be because the tokenizer encodes ! as [128000, 0], which decodes to <|begin_of_text|>!? Therefore this check fails?


if len(encoded_str) != 1:
    raise ValueError(f"Round-trip encoding of tokens [{token}] failed! Got {encoded_str}")```

hudson-ai commented 1 month ago

@jtbuter thanks for the repro -- I am able to reproduce the warning with transformers 4.45.2 (interestingly, not with my previously installed 4.44.0 version). We have a few methods for converting tokens into a form that we need in order to support constrained decoding, and the warning here is just saying that our preferred approach is failing and falling back to an alternative approach. Will definitely look into what's going on under the hood here -- thank you for the suggestion on where to look. I think you have the right idea.

Are you experiencing any downstream problems after seeing this warning?

This being said, the lexer error: too many states: 10406 >= 10000; stopping that @Crista23 is seeing "shouldn't" be caused by this -- it seems to be that our parser is finding the particular grammar they are constraining against to be disagreeable for whatever reason. I've seen something similar happen in some grammars where the parse tree is exceptionally ambiguous.

@Crista23 are you able to share any details about the constraints you are using? I would love to see us (1) improve robustness and (2) provide more helpful exceptions and warnings. A concrete example of what's causing this would really help to that end.

jtbuter commented 1 month ago

Thank you for the reply, I was not experiencing any other problems after this warning

guidance-ai / guidance

Round-trip encoding of tokens [!] failed, Warning: lexer error: too many states: 10406 >= 10000; stopping #1042