guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.14k stars 1.01k forks source link

Use python's `re` module instead of `pyformlang` to build regex ASTs #854

Closed hudson-ai closed 1 month ago

hudson-ai commented 1 month ago

Removes dependency on pyformlang, instead using python's re module directly.

This ensures that we conform as tightly to python's regex semantics as possible, and it should be easy to raise good exceptions rather than failing silently if we don't support a given corner of regex (e.g. lookarounds).

In python < 3.11, the relevant imports come from sre_parse and sre_constants, while in 3.11 and greater, we have to import from re._parser and re._constants. I'm a little bit concerned about using what's not considered public API, but it seems that there has been good effort to keep this part of re relatively stable even into 3.13. I've noticed some other libraries, e.g. lark reaching into the re/sre modules in the same way.

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 94.21488% with 7 lines in your changes missing coverage. Please review.

Project coverage is 60.16%. Comparing base (ad6fbb2) to head (a8b9dcb). Report is 1 commits behind head on main.

Files Patch % Lines
guidance/library/_regex.py 93.63% 7 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #854 +/- ## ========================================== + Coverage 57.17% 60.16% +2.99% ========================================== Files 63 63 Lines 4588 4622 +34 ========================================== + Hits 2623 2781 +158 + Misses 1965 1841 -124 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

hudson-ai commented 1 month ago

@riedgar-ms starting to feel pretty happy with this. I'd appreciate it if you could give it a once-over when you have the time :)

hudson-ai commented 1 month ago

I still need to take a closer look at what's causing some models to fail some of the tests. I do seeing one interesting failure so far though. Here's a test that's failing for phi2:


     def test_pattern_kleene(selected_model):
        lm = selected_model
        lm += "The Lord is my"
        x = lm + gen(name="tmp", max_tokens=10)
        y = lm + gen(name="tmp", regex=".*", max_tokens=10)
>       assert y["tmp"].startswith(
            x["tmp"]
        )  # TODO: we just check startswith because exact token limits are not perfect yet...

E       AssertionError: assert False
E        +  where False = <built-in method startswith of str object at 0x7f4398106830>(' shepherd, I shall not want.\nThe Lord')
E        +    where <built-in method startswith of str object at 0x7f4398106830> = ' shepherd, I shall not want.'.startswith

I don't think I agree with the test actually -- it implies that .* should be less restrictive than gen without any regex. Depending on what re flags we're imagining are being used (side note: should we support flags? :thinking:), .* should match anything without a newline. The test fails in this case because the unrestricted gen includes a newline.

hudson-ai commented 1 month ago

@Harsha-Nori pinging for visibility

Harsha-Nori commented 1 month ago

Thanks Hudson!

I still need to take a closer look at what's causing some models to fail some of the tests. I do seeing one interesting failure so far though. Here's a test that's failing for phi2:

     def test_pattern_kleene(selected_model):
        lm = selected_model
        lm += "The Lord is my"
        x = lm + gen(name="tmp", max_tokens=10)
        y = lm + gen(name="tmp", regex=".*", max_tokens=10)
>       assert y["tmp"].startswith(
            x["tmp"]
        )  # TODO: we just check startswith because exact token limits are not perfect yet...

E       AssertionError: assert False
E        +  where False = <built-in method startswith of str object at 0x7f4398106830>(' shepherd, I shall not want.\nThe Lord')
E        +    where <built-in method startswith of str object at 0x7f4398106830> = ' shepherd, I shall not want.'.startswith

I don't think I agree with the test actually -- it implies that .* should be less restrictive than gen without any regex. Depending on what re flags we're imagining are being used (side note: should we support flags? 🤔), .* should match anything without a newline. The test fails in this case because the unrestricted gen includes a newline.

Great catch! I agree that this is a test misspecification :). Perhaps we can change the test to use a more restrictive kleene star instead of a ., and then just test for that (vs. comparison against an unconstrained gen)?

Harsha-Nori commented 1 month ago

I like the direction this PR is moving in general, standardizing on the python re module would be nice to align with user expectations. Happy to do a more formal code review when you think it'd be helpful :). One dimension we should consider measuring, which isn't captured in our tests, is runtime performance (I'd expect re to be at least as good as pyformlang)?

hudson-ai commented 1 month ago

I like the direction this PR is moving in general, standardizing on the python re module would be nice to align with user expectations. Happy to do a more formal code review when you think it'd be helpful :). One dimension we should consider measuring, which isn't captured in our tests, is runtime performance (I'd expect re to be at least as good as pyformlang)?

I agree re: aligning with user expectations! Let me know when you have some time for a review -- it would definitely be helpful.

Re: testing performance -- interesting question, and definitely worth exploring. My expectation is also that re should be at least as good as pyformlang, and that's certainly the case for some special cases (e.g. large quantifier ranges do something scary and recursive in pyformlang, while this implementation just adds a bunch of optional grammars).

hudson-ai commented 1 month ago

@riedgar-ms I refactored the code used for testing json generation a bit by moving _generate_and_check over to utils. It's a pretty reusable piece of testing machinery which asserts that a Mock model can actually generate the string. I also added another check to that util to make sure that grammar.match correctly consumes the string (and appropriately captures).

hudson-ai commented 1 month ago

Oh, and I pulled the check_run_with_temperature out into its own function (which annoyingly made me have to return the underlying lm from generate_and_check...). I didn't want to add temperature support directly to regex due to how it's being called in gen -- I think there is a bug associated with wrapping a grammar in with_temperature more than once (the innermost one seems to be what's respected), so I'll take a look and open an issue if it seems appropriate

Harsha-Nori commented 1 month ago

I think there is a bug associated with wrapping a grammar in with_temperature more than once (the innermost one seems to be what's respected)

Perhaps not the perfect implementation, but this is intentionally by design. We were thinking about someday having an order of precedence, with inner specifications taking priority, so you could do things like (made up syntax):


lm = guidance.Model("blah", temperature=0.7) # overall default

lm += gen('hi') # uses 0.7

with Sampler(temperature=0.5):
    lm += gen('hi2') # uses 0.5

    lm += with_temperature(gen('hi3'), temperature=0.3) # uses 0.3

    lm += gen('hi4') # uses 0.5

lm += gen('hi5') # uses 0.7

The idea is that we could let people use a variety of tools for setting sampling parameters that apply to different ranges of their generation/grammar. None of this is implemented, however, except how temperature is represented in the grammar tree

Harsha-Nori commented 1 month ago

Now that I think about it more though, it's true that overriding a function which sets a default sampling parameter is not ergonomic (i.e. we must pass down the kwarg, not just wrap it with with_temperature as users might expect). Worth brainstorming on this in a separate issue if you'd like to open one up

hudson-ai commented 1 month ago

Ah, makes sense. I'll look at where it's being set inside of gen to see if I can't just make it set once and only once in the call to regex.

Speaking of which, how careful do you think we need to be about only making backwrds-compatible changes to the API? For now, I will add name and temperature to regex as keyword-only, but just curious for future reference.

For regex in particular, it seems that it's more commonly accessed indirectly through gen, but I'm not sure what people are doing in the wild.

Harsha-Nori commented 1 month ago

We'll want to be more stringent about back-compat in the near future, but I'm ok with the idea of API changes now. My preference is to accumulate a series of changes we'd like to make, then do a single update w/ appropriate announcement vs. micro adjustments per-release

hudson-ai commented 1 month ago

Now that I think about it more though, it's true that overriding a function which sets a default sampling parameter is not ergonomic (i.e. we must pass down the kwarg, not just wrap it with with_temperature as users might expect). Worth brainstorming on this in a separate issue if you'd like to open one up

Just a thought, but it might be sensible to make the default be None, where we don't actually wrap in wih_temperature if it's unset?

Harsha-Nori commented 1 month ago

For regex in particular, it seems that it's more commonly accessed indirectly through gen, but I'm not sure what people are doing in the wild.

It's mostly accessed through gen today, which is good ergonomics for this heavy hitting use case, but perhaps not very uniform/general (for example, should we also expose json and others through gen directly)? In general I like having kwargs on a smaller number of functions vs. forcing users to learn a broader function space, but it's worth thinking carefully about the design here going forward too

Harsha-Nori commented 1 month ago

Now that I think about it more though, it's true that overriding a function which sets a default sampling parameter is not ergonomic (i.e. we must pass down the kwarg, not just wrap it with with_temperature as users might expect). Worth brainstorming on this in a separate issue if you'd like to open one up

Just a thought, but it might be sensible to make the default be None, where we don't actually wrap in wih_temperature if it's unset?

Yes, I think this pattern generally works, I was thinking more about if/when we write functions that are perhaps opinionated about a temperature (I can see this being common with temp=0 for some things). But I think there's options here, we could write those to check for None internally maybe, or have None be the default on with_temperature style functions (which does a no-op and doesn't change anything), etc.