noamgat / lm-format-enforcer

Enforce the output format (JSON Schema, Regex etc) of a language model
MIT License
1.42k stars 65 forks source link

Leading Comma in JSON Array #99

Closed NJordan72 closed 4 months ago

NJordan72 commented 4 months ago

This one might be user error, but I was trying to understand how the JsonSchemaParser worked and wrote a quick and dirty loop to generate random JSON from a parser.

Some of the time it works fine, but other times I get invalid JSON. The most notable/reproducible of these errors is when an array of objects is generated with a leading comma [, {object},...] It appears that in some cases a comma is an allowable character after the array/list has been started.

There is some randomness involved in the code below but if I run in 4-5 times I can usually reproduce it.

from typing import List
from pydantic import BaseModel, Field

from lmformatenforcer import CharacterLevelParserConfig, JsonSchemaParser

import random
import json

class TreeNode(BaseModel):
    name: str = Field(max_length=4)
    children: List["TreeNode"] = Field(max_items=2)

class Result(BaseModel):
    tree: TreeNode

parser = JsonSchemaParser(Result.model_json_schema(), config=CharacterLevelParserConfig(max_consecutive_whitespaces=1))

result = ""

while True:
    allowable = parser.get_allowed_characters()

    if not allowable:
        break

    choice = random.choice(allowable)
    parser = parser.add_character(choice)
    result += choice

try:
    json.loads(result)
    print(result)
except:
    print(f"Invalid JSON: {result}")
noamgat commented 4 months ago

Can you make sure that you are using the latest LM format enforcer? We solved a few bugs like this in the past few versions. If you are, can you attach an example of a json that the library allows but is illegal, for the schema you show?

On Tue, May 14, 2024 at 3:39 PM NJordan72 @.***> wrote:

This one might be user error, but I was trying to understand how the JsonSchemaParser worked and wrote a quick and dirty loop to generate random JSON from a parser.

Some of the time it works fine, but other times I get invalid JSON. The most notable/reproducible of these errors is when an array of objects is generated with a leading comma [, {object},...] It appears that in some cases a comma is an allowable character after the array/list has been started.

There is some randomness involved in the code below but if I run in 4-5 times I can usually reproduce it.

from typing import Listfrom pydantic import BaseModel, Field from lmformatenforcer import CharacterLevelParserConfig, JsonSchemaParser import randomimport json class TreeNode(BaseModel): name: str = Field(max_length=4) children: List["TreeNode"] = Field(max_items=2) class Result(BaseModel): tree: TreeNode parser = JsonSchemaParser(Result.model_json_schema(), config=CharacterLevelParserConfig(max_consecutive_whitespaces=1)) result = "" while True: allowable = parser.get_allowed_characters()

if not allowable:
    break

choice = random.choice(allowable)
parser = parser.add_character(choice)
result += choice
try:
json.loads(result)
print(result)except:
print(f"Invalid JSON: {result}")

— Reply to this email directly, view it on GitHub https://github.com/noamgat/lm-format-enforcer/issues/99, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKFA2CAJU76LN5YRT7HQMLZCIAYVAVCNFSM6AAAAABHWDEKOGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4TKMZTGA4TMNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

NJordan72 commented 4 months ago

I am using 0.10.1

Here is the invalid output...

{"tree":{"name":"sJGl","children":[{"name":"czJ.","children":[{"name":"/sb!","children":[]},{"children":[],"name":"e*2g"}]},{"name":"u0Yp","children":[,{"children":[{"children":[,{"children":[{"name":"9q,+","children":[]}],"name":"83,M"}],"name":"J,gG"}],"name":"@B.}"}]}]}}

Notably there are two leading commas.

FWIW this is the Pydantic JSON Schema:

{'$defs': {'TreeNode': {'properties': {'name': {'maxLength': 4, 'title': 'Name', 'type': 'string'}, 'children': {'items': {'$ref': '#/$defs/TreeNode'}, 'maxItems': 2, 'title': 'Children', 'type': 'array'}}, 'required': ['name', 'children'], 'title': 'TreeNode', 'type': 'object'}}, 'properties': {'tree': {'$ref': '#/$defs/TreeNode'}}, 'required': ['tree'], 'title': 'Result', 'type': 'object'}
NJordan72 commented 4 months ago

As I iterate through you can see that after opening the [ that ]{, are all eligible and that comma seems wrong.

Allowable: c
Allowable: h
Allowable: i
Allowable: l
Allowable: d
Allowable: r
Allowable: e
Allowable: n
Allowable: "
Allowable: :
Allowable: [
Allowable: ]{,
noamgat commented 4 months ago

Should be fixed in v0.10.2, please reopen if the issue persists.