guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
19.14k stars 1.04k forks source link

JSON and TypeAdapters produce unwanted values or empty list #1069

Open AlbanPerli opened 3 weeks ago

AlbanPerli commented 3 weeks ago

Hi @hudson-ai!

Concerning the TypeAdapter constrained generation, here are some example of the issue mentioned here:

from guidance import models, capture
from guidance import json as jj
from pydantic import BaseModel, TypeAdapter
import json
from Noema.cfg import *

lm = models.LlamaCpp(
    "../Models/Mistral-NeMo-Minitron-8B-Instruct.Q4_K_M.gguf",
    n_gpu_layers=99,
    n_ctx=512*8,
    echo=False
)

lm.reset()
lm += "Generate a list of 3 integers between 1 and 4: " + capture(G.arrayOf(G.num()), name="generated_object")
print(lm["generated_object"])
# Output: ["1", "2", "3"]

lm.reset()
schema = TypeAdapter(list[int])
lm += "Generate a list of 3 integers between 0 and 4: " + jj(name="generated_object", schema=schema)
print(json.loads(lm["generated_object"]))
# Output: []

lm.reset()
lm += "Créé une liste des différentes étapes décrites ici: Ce matin je suis parti tot, puis j'ai acheté des pommes et enfin je suis allé au restaurant." + capture(G.arrayOf(G.sentence()), name="generated_object")
print(lm["generated_object"])
# Output: ["Ce matin je suis parti tot, puis j'ai acheté des pommes et enfin je suis allé au restaurant."]

lm.reset()
schema = TypeAdapter(list[str])
lm +=  "Créé une liste des différentes étapes décrites ici: Ce matin je suis parti tot, puis j'ai acheté des pommes et enfin je suis allé au restaurant." + jj(name="generated_object", schema=schema)
print(json.loads(lm["generated_object"]))
# Output: []

The file containing custom CFG is here.

This is just a workaround but it helps to produce a non empty list.

Concerning the JSON:

lm.reset()
class Schema(BaseModel):
     weather: str
lm += "What is the weather today? " + jj(name="generated_object", schema=Schema)
print(json.loads(lm["generated_object"]))
# Output using Minitron 8B : {'weather': ', '} 
# Output using llama3 instruct: {'weather': ':sunny:'}

I'm not sure to understand what the expected generation is, but it seems that characters from the format are interfering with the generated content.

hudson-ai commented 2 weeks ago

Hi @AlbanPerli sorry for the late reply here :)

I think that part of what you are encountering here is that lists aren't forced to be non-empty by default (I think your custom grammar definitions enforce a minimum length of one). If you want to enforce this behavior with TypeAdapters, you can use typing.Annotated and annotated_types.MinLen like so:

from typing import Annotated
from annotated_types import MinLen
from pydantic import TypeAdapter

ta = TypeAdapter(Annotated[list[int], MinLen(1)])
ta.json_schema()
# Output: {'items': {'type': 'integer'}, 'minItems': 1, 'type': 'array'}

You can of course get this behavior by just writing the JSON schema directly, or if you're using a pydantic.BaseModel, you can do these annotations a bit more ergonomically with the pydantic.Field descriptor.

If this doesn't address the core issue you're seeing, just let me know and we can figure it out :)

AlbanPerli commented 3 days ago

Hi @hudson-ai , my turn to apologize for the response time! :)

The point was indeed the minimum length, I wasn't aware of this parameter for the TypeAdapter.

Thank you!

hudson-ai commented 1 day ago

Ok, good to know that works for you! Let us know if you hit any other unexpected or unintuitive behaviors :)