guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.8k stars 1.04k forks source link

[Feature] Add extra support to JSON string schema #927

Closed riedgar-ms closed 3 months ago

riedgar-ms commented 3 months ago

Looking to add support for pattern, minLength and maxLength to strings in JSON schema. This should address half of #925

This has lead to the creation of a few new library functions to help out, and consolidation into a new _sequences.py library file. It is also hooked into _regex.py.

hudson-ai commented 3 months ago

Thanks for getting this started! Just a few comments for you as you're working on this...

  1. I'm a little bit wary about exposing direct regex generation via pattern as it will be difficult to enforce that a user's regex is properly escaped and will lead to something that is JSON loads-able... Although maybe all we have to do is document that using this functionality is "at your own risk"
  2. In the same vein, it may be impossible to enforce both a pattern and min/max length...
  3. I think you can do away with the direct select call inside _gen_json_string and just use a regex (optionally adding {min,max})... Take a look at the regex I sent on the related issue -- may be helpful
riedgar-ms commented 3 months ago

Thanks for getting this started! Just a few comments for you as you're working on this...

  1. I'm a little bit wary about exposing direct regex generation via pattern as it will be difficult to enforce that a user's regex is properly escaped and will lead to something that is JSON loads-able... Although maybe all we have to do is document that using this functionality is "at your own risk"
  2. In the same vein, it may be impossible to enforce both a pattern and min/max length...
  3. I think you can do away with the direct select call inside _gen_json_string and just use a regex (optionally adding {min,max})... Take a look at the regex I sent on the related issue -- may be helpful

For (1), I'd absolutely say that it's "at your own risk" functionality.

For (2), I'm quite confident that we cannot simultaneously enforce both a pattern and a min/maxLength. Imagine something like "pattern": "aaaa", "maxLength": 2 which is impossible to satisfy.

Of more immediate concern is that my negative test cases are failing really weirdly :-/

hudson-ai commented 3 months ago

Ah, the {b'"'} != {b'"'} isn't as weird as it looks -- it's a repr problem. Annoyingly valid_next_bytes doesn't actually return a set of bytes; it returns a set of Bytes and ByteRanges (which look like bytes in their reprs).

riedgar-ms commented 3 months ago

Ah, the {b'"'} != {b'"'} isn't as weird as it looks -- it's a repr problem. Annoyingly valid_next_bytes doesn't actually return a set of bytes; it returns a set of Bytes and ByteRanges (which look like bytes in their reprs).

That was on my "to investigate" list; was suspecting something like that.

Of more immediate concern: if you look closely at the failing tests, I believe you'll find that they ought to be passing.

codecov-commenter commented 3 months ago

Codecov Report

Attention: Patch coverage is 98.30508% with 1 line in your changes missing coverage. Please review.

Project coverage is 59.18%. Comparing base (ba86754) to head (bd30725).

Files Patch % Lines
guidance/library/_json.py 96.00% 1 Missing :warning:

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #927 +/- ## ========================================== + Coverage 57.18% 59.18% +1.99% ========================================== Files 64 63 -1 Lines 4711 4733 +22 ========================================== + Hits 2694 2801 +107 + Misses 2017 1932 -85 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

hudson-ai commented 3 months ago

Looking good! One ask -- could you maybe reuse the sequence code inside of _regex.py so we don't have two implementations of it?