Open h4gen opened 1 week ago
The enum casing issue shouldn't be happening..... do you have a repro case?
As for limiting string length, I don't think that's something we can readily do, unfortunately.
@riedgar-ms, I haven't been following the pydantic implementations closely, but if it's just for a string field, shouldn't we be able to support minLength/maxLength fairly trivially (even via regular expression)? I assume the semantics here are on characters, not words or something stranger
e.g.
from guidance import models, gen
lm = models.Transformers("microsoft/Phi-3-medium-4k-instruct", trust_remote_code=True, device_map="mps")
lm += "Write a letter followed by seven numbers: "
lm2 = lm + gen(regex=r"[a-zA-Z]\d{7}", temperature=0.7)
str(lm2) # 'Write a letter followed by seven numbers: U4719530'
Would love to better understand the complexity here and see if I can help. Thanks for raising this @h4gen!
Thanks @h4gen!
I agree with @riedgar-ms that it's a bit odd that you're running into case-insensitivity issues around literals. I will take a look at that and see if I can reproduce.
@Harsha-Nori I don't think it would be hugely difficult to write a regex to implement "strings of a certain length", although we need to be really careful about escape characters, etc.
@riedgar-ms would something like this suffice for strings between length N and M?
"(\\([\"\\\/bfnrt]|u[a-fA-F0-9]{4})|[^\"\\\x00-\x1F\x7F]){N,M}"
An aside, but I was thinking that it would be really nice if we could actually allow users to pass a regex via the pattern
field of the JSON schema (via pydantic this can be exposed via arguments to Field
or via Annotated
types). But again, the biggest issue will be escaped characters and in particular quotes if we want to guarantee that the generated JSON will always be loads-able. This will become easier if we can push through what @mmoskal and I have been working on -- in particular, his parser should be able to trivially handle intersections of grammars generated by regular expressions. May really come in handy in places like this.
Having been reminded about that bit of regex syntax, I promptly came across MAX_REPEAT
in _regex.py
...... We could copy that logic over.
No need to copy the logic, we should be able to use the regex grammar directly :) I just need a sanity check that my regex above is correct...
@h4gen I can't seem to repro the case-insensitivity. Do you have a specific/minimal example?
Hi all,
thanks for getting back so quick. Unfortunately it is unpredictable (from my point of view) when it happens as we have set no seeds/temp/topk for the model. Sorry, I made it sound like it was always the case, which is not true. However, what I can say is that I am pretty sure that it started after we started using Phi3 (I think mini as well as medium). In any case this is how we configure the model:
class Phi3(models.LlamaCppChat):
def get_role_start(self, role_name, **kwargs):
return f"<|{role_name}|>\n"
def get_role_end(self, role_name=None):
return "<|end|>\n"
lm = Phi3(
model="/content/model.gguf",
n_ctx=4096,
n_threads=4,
n_gpu_layers=50,
echo=True,
)
The Pydantic model is something like:
class Service(BaseModel):
"""A single service provided by the company."""
category: Union[
Literal[
"consulting",
"counseling",
"advisory",
"coaching",
"mentoring",
....
"other",
],
str,
] = Field(description="The category of the service.")
name: str = Field(
description="The name of the provided service. Max. 30 characters",
# min_length=5,
# max_length=30,
)
class Services(BaseModel):
"""A collection of services provided by a single company in English.
Every service has it's own list item and is unique in the collection.
"""
services: List[Service] = Field(
...,
description="A list of services provided by the company. Focus on the most relevant services.",
min_length=5,
max_length=10,
)
We then apply it to retrieved content from company websites which would make no sense to post here. So we basically use the Literals as zero shot classification. I can try to make a screenshot the next time I encounter it and post it here, but it does definitely happen!
Best, Hagen
@h4gen you've annotated category
here as Union[Literal[...], str]
-- are you only seeing the problem when you use this pattern? If so, this could be less about case sensitivity and instead be a result of taking the unconstrained str
branch of the Union
.
First of all many thanks to @hudson-ai for implementing the awesome Pydantic functionality. I really love what it is doing and works really great so far. While working with it, we noticed some things that I would like to suggest as features.
Is your feature request related to a problem? Please describe.
Literal and Enums Literals or Enums which are translated to JSON enum from Pydantic do not seem to be case sensitive right now. I.e. If my Literal is
Literal["cat", "dog"]
the model currently can output"Cat"
and"Dog"
which will lead to a validation error when the output is passed to the Pydantic class.String length Same for string length. The best effect is currently produced by passing a Pydantic Field description that states the max length but Pydantic
min_length
/max_length
seem to have no effect and the model kann keep on talking endlessly.Describe the solution you'd like
...
just to prevent the Schema validation from failing.Describe alternatives you've considered Right now this can be worked around via
field_validators
withmode='before'
. For Literal/Enum:and for strings:
However, it is not the most elegant approach at least for the strings.
Happy to provide further context if needed :)