[Bug]: Bug in Guided Generation Logits Processorwith `n>1`

maximzubkov commented 7 months ago

Your current environment

I used Docker:

git clone https://github.com/vllm-project/vllm.git
git pull origin pull/3211/head
docker build  --target test -t vllm-grammars .
docker run --gpus=all -it vllm-grammars

On the server with 4x NVIDIA RTX A4000

🐛 Describe the bug

I tested the Context Free Grammar with vLLM and asked phi-1 to generate a simple SQL query, following this test from a recent PR

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.model_executor.guided_logits_processors import CFGLogitsProcessor

model = "microsoft/phi_1"
prompt = "Writa a simple SQL query to the table table_2 checking if col_1 equals to 1"

tokenizer = AutoTokenizer.from_pretrained(model)

simple_sql_grammar = """
start: select_statement

select_statement: "SELECT" column "from" table "where" condition

column: "col_1" | "col_2"
table: "table_1" | "table_2"
condition: column "=" number

number: "1" | "2"
"""
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    n=10,
    max_tokens=512,
    logits_processors=[CFGLogitsProcessor(simple_sql_grammar, tokenizer)]
)
llm = LLM(model=model, dtype="auto")
outputs = llm.generate([prompt], sampling_params)
print([
    output_.text for output_ in outputs[0].outputs
])

Although the CFGLogitsProcessor feature is not merged yet (the PR is still opened), the above example worked fine when I used n=1 in SamplingParams. However, when I switched to n=10, my code failed with:

tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 237/237 [00:00<00:00, 746kB/s]
vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 798k/798k [00:00<00:00, 2.72MB/s]
merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 48.3MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.11M/2.11M [00:00<00:00, 13.2MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.08k/1.08k [00:00<00:00, 4.06MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 377kB/s]
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 862/862 [00:00<00:00, 2.90MB/s]
INFO 03-16 17:24:54 llm_engine.py:68] Initializing an LLM engine (v0.3.3) with config: model='microsoft/phi_1', tokenizer='microsoft/phi_1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
INFO 03-16 17:24:56 attention.py:83] Using flash_attn backend.
INFO 03-16 17:24:57 weight_utils.py:167] Using model weights format ['*.bin']
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.84G/2.84G [00:24<00:00, 114MB/s]
INFO 03-16 17:25:24 model_runner.py:96] Loading model weights took 2.6419 GB
INFO 03-16 17:25:25 gpu_executor.py:99] # GPU blocks: 3691, # CPU blocks: 1365
INFO 03-16 17:25:26 model_runner.py:691] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 03-16 17:25:26 model_runner.py:695] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 03-16 17:25:30 model_runner.py:763] Graph capturing finished in 4 secs.
Processed prompts:   0%|                                                                                                                                                                                                | 0/1 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/lark/lexer.py", line 673, in lex
    token = self.root_lexer.next_token(lexer_state, parser_state)
  File "/usr/local/lib/python3.10/dist-packages/lark/lexer.py", line 598, in next_token
    raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,
lark.exceptions.UnexpectedCharacters: No terminal matches 'S' in the current parser context, at line 1 col 1

SSSSSSSSSSELELELELELELELELELELECT
^
Expected one of: 
        * WHERE
        * TABLE_2
        * COL_2
        * "2"
        * TABLE_1
        * COL_1
        * EQUAL
        * SELECT
        * "1"
        * FROM

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/vllm-workspace/example.py", line 30, in <module>
    outputs = llm.generate([prompt], sampling_params)
  File "/vllm-workspace/vllm/entrypoints/llm.py", line 174, in generate
    return self._run_engine(use_tqdm)
  File "/vllm-workspace/vllm/entrypoints/llm.py", line 200, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/vllm-workspace/vllm/engine/llm_engine.py", line 621, in step
    output = self.model_executor.execute_model(
  File "/vllm-workspace/vllm/executor/gpu_executor.py", line 119, in execute_model
    output = self.driver_worker.execute_model(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/worker.py", line 222, in execute_model
    output = self.model_runner.execute_model(seq_group_metadata_list,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/vllm-workspace/vllm/worker/model_runner.py", line 598, in execute_model
    output = self.model.sample(
  File "/vllm-workspace/vllm/model_executor/models/phi.py", line 263, in sample
    next_tokens = self.sampler(head.weight, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vllm-workspace/vllm/model_executor/layers/sampler.py", line 83, in forward
    logits = _apply_logits_processors(logits, sampling_metadata)
  File "/vllm-workspace/vllm/model_executor/layers/sampler.py", line 165, in _apply_logits_processors
    logits_row = logits_processor(token_ids, logits_row)
  File "/vllm-workspace/vllm/model_executor/guided_logits_processors.py", line 92, in __call__
    allowed_tokens = self.fsm.allowed_token_ids(self.fsm_state[seq_id])
  File "/usr/local/lib/python3.10/dist-packages/outlines/fsm/fsm.py", line 313, in allowed_token_ids
    interactive.exhaust_lexer()
  File "/usr/local/lib/python3.10/dist-packages/lark/parsers/lalr_interactive_parser.py", line 52, in exhaust_lexer
    return list(self.iter_parse())
  File "/usr/local/lib/python3.10/dist-packages/lark/parsers/lalr_interactive_parser.py", line 43, in iter_parse
    for token in self.lexer_thread.lex(self.parser_state):
  File "/usr/local/lib/python3.10/dist-packages/lark/lexer.py", line 676, in lex
    raise e  # Raise the original UnexpectedCharacters. The root lexer raises it with the wrong expected set.
  File "/usr/local/lib/python3.10/dist-packages/lark/lexer.py", line 665, in lex
    yield lexer.next_token(lexer_state, parser_state)
  File "/usr/local/lib/python3.10/dist-packages/lark/lexer.py", line 598, in next_token
    raise UnexpectedCharacters(lex_state.text, line_ctr.char_pos, line_ctr.line, line_ctr.column,
lark.exceptions.UnexpectedCharacters: No terminal matches 'S' in the current parser context, at line 1 col 1

SSSSSSSSSSELELELELELELELELELELECT
^
Expected one of: 
        * SELECT

Processed prompts:   0%|                                                                                                                                                                                                | 0/1 [00:10<?, ?it/s]

Diving deeper into the code, I figured that this bug would occur with RegexLogitsProcessor and JSONLogitsProcessor as well due to the current implementation of _apply_logits_processors. Both RegexLogitsProcessor, JSONLogitsProcessor, and CFGLogitsProcessor are calling self.fsm.allowed_token_ids for every sequence considered by the beam search, and perhaps due to this fact the cache is stored incorrectly (as you can see from the bug, the cache is shared between 10 beams and every token predicted by every beam is added to cache SSSSSSSSSSELELELELELELELELELELECT). So maybe it would make sense to change the _apply_logits_processors letting every beam to have its own processor, e.g.:

logits_processors = sampling_params.logits_processors[logits_row_idx]

Looking forward to your response, and I would be happy to implement the changes under your guidance!

simon-mo commented 7 months ago

Good find! I just merged the PR. You are correct in diagnosing the issue. The right fix would be cloning the logits processor for every sequence in the group. Contribution welcomed indeed!

maximzubkov commented 7 months ago

Hello, @simon-mo! Thank you for the prompt response, I implemented the fix to it and left some comments within the code to explain why I made certain design decisions, see the following RP

maximzubkov commented 7 months ago

I also slightly updated the script to reproduce the issue to take into account the case when there are multiple requests to the engine via [prompt, prompt] (this also used to fail with the same bug when I tested it)

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.model_executor.guided_logits_processors import CFGLogitsProcessor

model = "microsoft/phi_1"
prompt = "Writa a simple SQL query to the table table_2 checking if col_1 equals to 1"

tokenizer = AutoTokenizer.from_pretrained(model)

simple_sql_grammar = """
start: select_statement

select_statement: "SELECT" column "from" table "where" condition

column: "col_1" | "col_2"
table: "table_1" | "table_2"
condition: column "=" number

number: "1" | "2"
"""
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    n=10,
    max_tokens=512,
    logits_processors=[CFGLogitsProcessor(simple_sql_grammar, tokenizer)]
)
llm = LLM(model=model, dtype="auto")
outputs = llm.generate([prompt, prompt], sampling_params)
print([
    output_.text for output_ in outputs[0].outputs
])

vllm-project / vllm

[Bug]: Bug in Guided Generation Logits Processorwith `n>1` #3448

Your current environment

🐛 Describe the bug