noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

When using json schema in vLLM/Aphrodite-engine, lmfe generates a lot of `":"` as json properties #94

Closed sgsdxzy closed 5 months ago

sgsdxzy commented 5 months ago

The json template to send as "guided_json" in the request to oai api of vLLM/Aphrodite:

json_template = {
    "type": "array",
    "items": {
        "type": "object",
        "properties": {
            "tool_name": {"type": "string"},
            "parameters": {
                "type": "object",
                "additionalProperties": {
                    "anyOf": [{"type": "string"}, {"type": "number"}, {"type": "boolean"}]
                },
            },
        },
        "required": ["tool_name", "parameters"],
        "additionalProperties": {},
    },
    "minItems": 1,
    "maxItems": 1,
}

Model output without any constrant

[
    {
        "tool_name": "internet_search",
        "parameters": {
            "query": "biggest penguin species",
            "provider": "Google"
        }
    }
]

Model output with "guided_json" and "guided_decoding_backend": "lm-format-enforcer"

[
    {
        "tool_name": "internet_search",
        "parameters": {
           ":": "biggest penguin species world"
        }
    }
]

Model output with "guided_json" and "guided_decoding_backend": "outlines"

[
    {
        "tool_name": "internet_search",
        "parameters": {
            "query": "biggest penguin species",
            "provider": "Google"
        }
    }
]

The model: CohereForAI/c4ai-command-r-plus The prompt: (adapted from the function calling example of c4ai-command-r-plus)

noamgat commented 5 months ago

How are you running this? Through the latest vLLM image? The latest vLLM version does not yet contain a few JSON Schema bug fixes that were made in the past weeks. (The PR was approved, but not in 0.4.1). Can you make sure you are on the latest version?

On Mon, Apr 29, 2024 at 9:01 PM sgsdxzy @.***> wrote:

The json template to send as "guided_json" in the request to oai api of vLLM/Aphrodite:

json_template = { "type": "array", "items": { "type": "object", "properties": { "tool_name": {"type": "string"}, "parameters": { "type": "object", "additionalProperties": { "anyOf": [{"type": "string"}, {"type": "number"}, {"type": "boolean"}] }, }, }, "required": ["tool_name", "parameters"], "additionalProperties": {}, }, "minItems": 1, "maxItems": 1, }

Model output without any constrant

[ { "tool_name": "internet_search", "parameters": { "query": "biggest penguin species", "provider": "Google" } } ]

Model output with "guided_json" and "guided_decoding_backend": "lm-format-enforcer"

[ { "tool_name": "internet_search", "parameters": { ":": "biggest penguin species world" } } ]

Model output with "guided_json" and "guided_decoding_backend": "outlines"

[ { "tool_name": "internet_search", "parameters": { "query": "biggest penguin species", "provider": "Google" } } ]

The model: CohereForAI/c4ai-command-r-plus The prompt https://pastebin.com/rtx2PcVv: (adapted from the function calling example of c4ai-command-r-plus)

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/94, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2FKG5GD7CUV2OUYQH3Y72DJPAVCNFSM6AAAAABG6ZKI7OVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3DSNRTGQ2TSNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

sgsdxzy commented 5 months ago

Yeah I was using the release version of 0.4.1, will try latest, thanks. Could you point me to the related vLLM PR?

sgsdxzy commented 5 months ago

I am using vllm main with lmfe==0.9.8, and during the same request I encountered:

ERROR:root:Unknown LMFormatEnforcer Problem. Prefix: '[
    {
        "tool_name": "internet_search",
        "parameters": {
           "hquery": "biggest penguin in the world",
           "hprovider": "Google"
        }
'
Terminating the parser. Please open an issue at
https://github.com/noamgat/lm-format-enforcer/issues with the prefix and CharacterLevelParser parameters
Traceback (most recent call last):
  File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 96, in _compute_allowed_tokens
    self._collect_allowed_tokens(state.parser, self.tokenizer_tree.root, allowed_tokens, shortcut_key)
  File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 144, in _collect_allowed_tokens
    self._collect_allowed_tokens(next_parser, next_tree_node, allowed_tokens, None)
  File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 144, in _collect_allowed_tokens
    self._collect_allowed_tokens(next_parser, next_tree_node, allowed_tokens, None)
  File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 142, in _collect_allowed_tokens
    next_parser = parser.add_character(character)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/jsonschemaparser.py", line 63, in add_character
    while new_character not in self.object_stack[receiving_idx].get_allowed_characters():
                               ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
IndexError: list index out of range

Any idea why?

noamgat commented 5 months ago

Interesting. This looks closer to a good completion so the version upgrade fixed something, but this looks like a new bug. I hope to get to it in the coming days. It would be best if you created a unit test that reproduces the problem, but if not I will try based on the information you provided here. My guess is that its related to the fact that the top-level object is an array, which should be supported but is less battle tested than the top level object being a dict.

On Tue, Apr 30, 2024 at 2:33 PM sgsdxzy @.***> wrote:

I am using vllm main with lmfe==0.9.8, and during the same request I encountered:

ERROR:root:Unknown LMFormatEnforcer Problem. Prefix: '[ { "tool_name": "internet_search", "parameters": { "hquery": "biggest penguin in the world", "hprovider": "Google" } ' Terminating the parser. Please open an issue athttps://github.com/noamgat/lm-format-enforcer/issues with the prefix and CharacterLevelParser parameters Traceback (most recent call last): File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 96, in _compute_allowed_tokens self._collect_allowed_tokens(state.parser, self.tokenizer_tree.root, allowed_tokens, shortcut_key) File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 144, in _collect_allowed_tokens self._collect_allowed_tokens(next_parser, next_tree_node, allowed_tokens, None) File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 144, in _collect_allowed_tokens self._collect_allowed_tokens(next_parser, next_tree_node, allowed_tokens, None) File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/tokenenforcer.py", line 142, in _collect_allowed_tokens next_parser = parser.add_character(character) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/sgsdxzy/micromamba/envs/vllm/lib/python3.11/site-packages/lmformatenforcer/jsonschemaparser.py", line 63, in add_character while new_character not in self.object_stack[receiving_idx].get_allowed_characters():


IndexError: list index out of range

Any idea why?

—
Reply to this email directly, view it on GitHub
<https://github.com/noamgat/lm-format-enforcer/issues/94#issuecomment-2085057913>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKFA2GKOIUY56KPUL3FBC3Y756RVAVCNFSM6AAAAABG6ZKI7OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBVGA2TOOJRGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
sgsdxzy commented 5 months ago

Yes change the outmost array to object works fine.

sgsdxzy commented 5 months ago

I might be having some lucky draw. If I fix the random number seed I am still getting

[
    {
        "tool_name": "internet_search",
        "parameters": {
           ":": "biggest penguin species world"
        }
    }
]

as response as of lmfe 0.9.8, alongside the Unknown LMFormatEnforcer Problem. Error above.

noamgat commented 5 months ago

Thanks! I hope to look at this in the coming days.

noamgat commented 5 months ago

I just released 0.9.10 with a fix that should remove the Unknown LMFormatEnforcer Problem issue. However, I'm not sure it will solve your problem, as the last response you posted conforms to the json schema you posted (obviously its not a good one, but thats not LMFE's job).

sgsdxzy commented 5 months ago

I can confirm 0.9.10 fixed the ERROR:root:Unknown LMFormatEnforcer Problem. However I think lmfe gives the LLM wrong logits that prevents some valid responses from generation. It made conforming to the schema necessary but insufficient: It does enforce the json schema so all generations are valid, but not all valid generations are allowed.

[
    {
        "tool_name": "internet_search",
        "parameters": {
            "query": "biggest penguin species",
            "provider": "Google"
        }
    }
]

also conforms to the schema and should be a much more likely generation (it will be generated without any contraint by the LLM, and by outlines) but prevented by lmfe somehow.

noamgat commented 5 months ago

I can confirm 0.9.10 fixed the ERROR:root:Unknown LMFormatEnforcer Problem. However I think lmfe gives the LLM wrong logits that prevents some valid responses from generation. It made conforming to the schema necessary but insufficient: It does enforce the json schema so all generations are valid, but not all valid generations are allowed.

[
    {
        "tool_name": "internet_search",
        "parameters": {
            "query": "biggest penguin species",
            "provider": "Google"
        }
    }
]

also conforms to the schema and should be a much more likely generation (it will be generated without any contraint by the LLM, and by outlines) but prevented by lmfe somehow.

I now see the problem - this language model prefers to use spaces instead of tab, and uses indentation of length 4. This causes the completion to contain 13 consecutive whitespaces (newline + 12). LMFE has a heuristic const MAX_CONSECUTIVE_WHITESPACES=12 in order to avoid infinite whitespace loops (Which are legal jsons, but probably unwanted). If you are able to test this, can you try increasing this constant and trying again?

One of the upcoming features I plan to add is to allow environment variables to modify some of these configurations, to make it easier to change some of LMFE's heuristics in non-code environments (such as vLLM OpenAI server)

sgsdxzy commented 5 months ago

I can confirm setting MAX_CONSECUTIVE_WHITESPACES in consts.py to some larger value completely fixes this. Yes making it configurable through environment variables is a better solution to this.

noamgat commented 5 months ago

https://github.com/noamgat/lm-format-enforcer/pull/97 - Coming very soon :)

noamgat commented 5 months ago

Released in v0.10.1. Can you check if you can now solve the problem via Configuration Options?

sgsdxzy commented 5 months ago

v0.10.1 fixes this issue.