Support for case sensitive Literal/Enum and maxLength, minLength for strings

h4gen commented 1 week ago

First of all many thanks to @hudson-ai for implementing the awesome Pydantic functionality. I really love what it is doing and works really great so far. While working with it, we noticed some things that I would like to suggest as features.

Is your feature request related to a problem? Please describe.

Literal and Enums Literals or Enums which are translated to JSON enum from Pydantic do not seem to be case sensitive right now. I.e. If my Literal is Literal["cat", "dog"] the model currently can output "Cat" and "Dog" which will lead to a validation error when the output is passed to the Pydantic class.
String length Same for string length. The best effect is currently produced by passing a Pydantic Field description that states the max length but Pydantic min_length/max_length seem to have no effect and the model kann keep on talking endlessly.

Describe the solution you'd like

Checking the casing and existence of literals during generation
Restrict the length of output strings via parameter. If it is hard to guide the model to stop it would be at least better to brute force an ending with an ellipsis like ... just to prevent the Schema validation from failing.

Describe alternatives you've considered Right now this can be worked around via field_validators with mode='before'. For Literal/Enum:

    @field_validator("category", mode='before')
    def check_category(cls, v):
        return v.lower().strip()

and for strings:

    @field_validator("description", mode='before')
    def check_description(cls, v):
        if len(v) > 100:
            return v[: 100 - 3] + "..."
        return v

However, it is not the most elegant approach at least for the strings.

Happy to provide further context if needed :)

riedgar-ms commented 1 week ago

The enum casing issue shouldn't be happening..... do you have a repro case?

As for limiting string length, I don't think that's something we can readily do, unfortunately.

Harsha-Nori commented 1 week ago

@riedgar-ms, I haven't been following the pydantic implementations closely, but if it's just for a string field, shouldn't we be able to support minLength/maxLength fairly trivially (even via regular expression)? I assume the semantics here are on characters, not words or something stranger

e.g.

from guidance import models, gen
lm = models.Transformers("microsoft/Phi-3-medium-4k-instruct", trust_remote_code=True, device_map="mps")
lm += "Write a letter followed by seven numbers: "
lm2 = lm + gen(regex=r"[a-zA-Z]\d{7}", temperature=0.7)
str(lm2) # 'Write a letter followed by seven numbers: U4719530'

Would love to better understand the complexity here and see if I can help. Thanks for raising this @h4gen!

hudson-ai commented 1 week ago

Thanks @h4gen!

I agree with @riedgar-ms that it's a bit odd that you're running into case-insensitivity issues around literals. I will take a look at that and see if I can reproduce.

@Harsha-Nori I don't think it would be hugely difficult to write a regex to implement "strings of a certain length", although we need to be really careful about escape characters, etc.

@riedgar-ms would something like this suffice for strings between length N and M? "(\\([\"\\\/bfnrt]|u[a-fA-F0-9]{4})|[^\"\\\x00-\x1F\x7F]){N,M}"

An aside, but I was thinking that it would be really nice if we could actually allow users to pass a regex via the pattern field of the JSON schema (via pydantic this can be exposed via arguments to Field or via Annotated types). But again, the biggest issue will be escaped characters and in particular quotes if we want to guarantee that the generated JSON will always be loads-able. This will become easier if we can push through what @mmoskal and I have been working on -- in particular, his parser should be able to trivially handle intersections of grammars generated by regular expressions. May really come in handy in places like this.

riedgar-ms commented 1 week ago

Having been reminded about that bit of regex syntax, I promptly came across MAX_REPEAT in _regex.py...... We could copy that logic over.

hudson-ai commented 1 week ago

No need to copy the logic, we should be able to use the regex grammar directly :) I just need a sanity check that my regex above is correct...

hudson-ai commented 1 week ago

@h4gen I can't seem to repro the case-insensitivity. Do you have a specific/minimal example?

h4gen commented 1 week ago

Hi all,

thanks for getting back so quick. Unfortunately it is unpredictable (from my point of view) when it happens as we have set no seeds/temp/topk for the model. Sorry, I made it sound like it was always the case, which is not true. However, what I can say is that I am pretty sure that it started after we started using Phi3 (I think mini as well as medium). In any case this is how we configure the model:

class Phi3(models.LlamaCppChat):
    def get_role_start(self, role_name, **kwargs):
        return f"<|{role_name}|>\n"

    def get_role_end(self, role_name=None):
        return "<|end|>\n"

lm = Phi3(
    model="/content/model.gguf",
    n_ctx=4096,  
    n_threads=4,  
    n_gpu_layers=50, 
    echo=True,
)

The Pydantic model is something like:

class Service(BaseModel):
    """A single service provided by the company."""

    category: Union[
        Literal[
            "consulting",
            "counseling",
            "advisory",
            "coaching",
            "mentoring",
            ....
            "other",
        ],
        str,
    ] = Field(description="The category of the service.")
    name: str = Field(
        description="The name of the provided service. Max. 30 characters",
        # min_length=5,
        # max_length=30,
    )

class Services(BaseModel):
    """A collection of services provided by a single company in English.
      Every service has it's own list item and is unique in the collection.
      """

    services: List[Service] = Field(
        ...,
        description="A list of services provided by the company. Focus on the most relevant services.",
        min_length=5,
        max_length=10,
    )

We then apply it to retrieved content from company websites which would make no sense to post here. So we basically use the Literals as zero shot classification. I can try to make a screenshot the next time I encounter it and post it here, but it does definitely happen!

Best, Hagen

hudson-ai commented 1 week ago

@h4gen you've annotated category here as Union[Literal[...], str] -- are you only seeing the problem when you use this pattern? If so, this could be less about case sensitivity and instead be a result of taking the unconstrained str branch of the Union.

guidance-ai / guidance

Support for case sensitive Literal/Enum and maxLength, minLength for strings #925