Closed hudson-ai closed 1 month ago
Attention: Patch coverage is 94.21488%
with 7 lines
in your changes missing coverage. Please review.
Project coverage is 60.16%. Comparing base (
ad6fbb2
) to head (a8b9dcb
). Report is 1 commits behind head on main.
Files | Patch % | Lines |
---|---|---|
guidance/library/_regex.py | 93.63% | 7 Missing :warning: |
:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@riedgar-ms starting to feel pretty happy with this. I'd appreciate it if you could give it a once-over when you have the time :)
I still need to take a closer look at what's causing some models to fail some of the tests. I do seeing one interesting failure so far though. Here's a test that's failing for phi2:
def test_pattern_kleene(selected_model):
lm = selected_model
lm += "The Lord is my"
x = lm + gen(name="tmp", max_tokens=10)
y = lm + gen(name="tmp", regex=".*", max_tokens=10)
> assert y["tmp"].startswith(
x["tmp"]
) # TODO: we just check startswith because exact token limits are not perfect yet...
E AssertionError: assert False
E + where False = <built-in method startswith of str object at 0x7f4398106830>(' shepherd, I shall not want.\nThe Lord')
E + where <built-in method startswith of str object at 0x7f4398106830> = ' shepherd, I shall not want.'.startswith
I don't think I agree with the test actually -- it implies that .*
should be less restrictive than gen
without any regex.
Depending on what re
flags we're imagining are being used (side note: should we support flags? :thinking:), .*
should match anything without a newline. The test fails in this case because the unrestricted gen includes a newline.
@Harsha-Nori pinging for visibility
Thanks Hudson!
I still need to take a closer look at what's causing some models to fail some of the tests. I do seeing one interesting failure so far though. Here's a test that's failing for phi2:
def test_pattern_kleene(selected_model): lm = selected_model lm += "The Lord is my" x = lm + gen(name="tmp", max_tokens=10) y = lm + gen(name="tmp", regex=".*", max_tokens=10) > assert y["tmp"].startswith( x["tmp"] ) # TODO: we just check startswith because exact token limits are not perfect yet... E AssertionError: assert False E + where False = <built-in method startswith of str object at 0x7f4398106830>(' shepherd, I shall not want.\nThe Lord') E + where <built-in method startswith of str object at 0x7f4398106830> = ' shepherd, I shall not want.'.startswith
I don't think I agree with the test actually -- it implies that
.*
should be less restrictive thangen
without any regex. Depending on whatre
flags we're imagining are being used (side note: should we support flags? 🤔),.*
should match anything without a newline. The test fails in this case because the unrestricted gen includes a newline.
Great catch! I agree that this is a test misspecification :). Perhaps we can change the test to use a more restrictive kleene star instead of a .
, and then just test for that (vs. comparison against an unconstrained gen)?
I like the direction this PR is moving in general, standardizing on the python re module would be nice to align with user expectations. Happy to do a more formal code review when you think it'd be helpful :). One dimension we should consider measuring, which isn't captured in our tests, is runtime performance (I'd expect re
to be at least as good as pyformlang)?
I like the direction this PR is moving in general, standardizing on the python re module would be nice to align with user expectations. Happy to do a more formal code review when you think it'd be helpful :). One dimension we should consider measuring, which isn't captured in our tests, is runtime performance (I'd expect
re
to be at least as good as pyformlang)?
I agree re: aligning with user expectations! Let me know when you have some time for a review -- it would definitely be helpful.
Re: testing performance -- interesting question, and definitely worth exploring. My expectation is also that re
should be at least as good as pyformlang, and that's certainly the case for some special cases (e.g. large quantifier ranges do something scary and recursive in pyformlang, while this implementation just adds a bunch of optional
grammars).
@riedgar-ms I refactored the code used for testing json generation a bit by moving _generate_and_check
over to utils. It's a pretty reusable piece of testing machinery which asserts that a Mock
model can actually generate the string. I also added another check to that util to make sure that grammar.match
correctly consumes the string (and appropriately captures).
Oh, and I pulled the check_run_with_temperature
out into its own function (which annoyingly made me have to return the underlying lm from generate_and_check
...). I didn't want to add temperature support directly to regex
due to how it's being called in gen
-- I think there is a bug associated with wrapping a grammar in with_temperature
more than once (the innermost one seems to be what's respected), so I'll take a look and open an issue if it seems appropriate
I think there is a bug associated with wrapping a grammar in with_temperature more than once (the innermost one seems to be what's respected)
Perhaps not the perfect implementation, but this is intentionally by design. We were thinking about someday having an order of precedence, with inner specifications taking priority, so you could do things like (made up syntax):
lm = guidance.Model("blah", temperature=0.7) # overall default
lm += gen('hi') # uses 0.7
with Sampler(temperature=0.5):
lm += gen('hi2') # uses 0.5
lm += with_temperature(gen('hi3'), temperature=0.3) # uses 0.3
lm += gen('hi4') # uses 0.5
lm += gen('hi5') # uses 0.7
The idea is that we could let people use a variety of tools for setting sampling parameters that apply to different ranges of their generation/grammar. None of this is implemented, however, except how temperature is represented in the grammar tree
Now that I think about it more though, it's true that overriding a function which sets a default sampling parameter is not ergonomic (i.e. we must pass down the kwarg, not just wrap it with with_temperature
as users might expect). Worth brainstorming on this in a separate issue if you'd like to open one up
Ah, makes sense. I'll look at where it's being set inside of gen to see if I can't just make it set once and only once in the call to regex.
Speaking of which, how careful do you think we need to be about only making backwrds-compatible changes to the API? For now, I will add name
and temperature
to regex
as keyword-only, but just curious for future reference.
For regex
in particular, it seems that it's more commonly accessed indirectly through gen, but I'm not sure what people are doing in the wild.
We'll want to be more stringent about back-compat in the near future, but I'm ok with the idea of API changes now. My preference is to accumulate a series of changes we'd like to make, then do a single update w/ appropriate announcement vs. micro adjustments per-release
Now that I think about it more though, it's true that overriding a function which sets a default sampling parameter is not ergonomic (i.e. we must pass down the kwarg, not just wrap it with
with_temperature
as users might expect). Worth brainstorming on this in a separate issue if you'd like to open one up
Just a thought, but it might be sensible to make the default be None
, where we don't actually wrap in wih_temperature
if it's unset?
For regex in particular, it seems that it's more commonly accessed indirectly through gen, but I'm not sure what people are doing in the wild.
It's mostly accessed through gen
today, which is good ergonomics for this heavy hitting use case, but perhaps not very uniform/general (for example, should we also expose json
and others through gen directly)? In general I like having kwargs on a smaller number of functions vs. forcing users to learn a broader function space, but it's worth thinking carefully about the design here going forward too
Now that I think about it more though, it's true that overriding a function which sets a default sampling parameter is not ergonomic (i.e. we must pass down the kwarg, not just wrap it with
with_temperature
as users might expect). Worth brainstorming on this in a separate issue if you'd like to open one upJust a thought, but it might be sensible to make the default be
None
, where we don't actually wrap inwih_temperature
if it's unset?
Yes, I think this pattern generally works, I was thinking more about if/when we write functions that are perhaps opinionated about a temperature (I can see this being common with temp=0 for some things). But I think there's options here, we could write those to check for None internally maybe, or have None be the default on with_temperature style functions (which does a no-op and doesn't change anything), etc.
Removes dependency on
pyformlang
, instead using python'sre
module directly.This ensures that we conform as tightly to python's regex semantics as possible, and it should be easy to raise good exceptions rather than failing silently if we don't support a given corner of regex (e.g. lookarounds).
In python < 3.11, the relevant imports come from
sre_parse
andsre_constants
, while in3.11
and greater, we have to import fromre._parser
andre._constants
. I'm a little bit concerned about using what's not considered public API, but it seems that there has been good effort to keep this part ofre
relatively stable even into3.13
. I've noticed some other libraries, e.g.lark
reaching into there
/sre
modules in the same way.