guidance-ai / guidance

A guidance language for controlling large language models.
MIT License
18.81k stars 1.04k forks source link

Cache computation of JSON nodes to improve performance #995

Open hudson-ai opened 1 month ago

hudson-ai commented 1 month ago

WIP. Seeing 4x speedup on large nested schema after adding caching, but we can probably squeeze more performance out of this. Not sure if the frozendict dep is strictly necessary; just trying out a quick idea.

codecov-commenter commented 1 month ago

:warning: Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 61.56%. Comparing base (6eb08f4) to head (84bd1e0).

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

:exclamation: There is a different number of reports uploaded between BASE (6eb08f4) and HEAD (84bd1e0). Click for more details.

HEAD has 57 uploads less than BASE | Flag | BASE (6eb08f4) | HEAD (84bd1e0) | |------|------|------| ||124|67|
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #995 +/- ## ========================================== - Coverage 70.25% 61.56% -8.70% ========================================== Files 62 62 Lines 4472 4483 +11 ========================================== - Hits 3142 2760 -382 - Misses 1330 1723 +393 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

hudson-ai commented 4 weeks ago

@riedgar-ms tests/unit/library/test_json.py::TestOneOf::test_oneOf_compound[True] is currently failing because the warning only happens on the first call (the second just hits the cache). Any thoughts?

hudson-ai commented 3 weeks ago

An open question: is top/module-level caching a good idea? I could add a layer of encapsulation such that all of the underscore prefixed functions only maintain a cache within a single call to json... Thoughts?

Also, feeling pretty okay about the frozendict requirement. It helps a lot with making everything nicely hashable without too much fuss.

riedgar-ms commented 3 weeks ago

An open question: is top/module-level caching a good idea? I could add a layer of encapsulation such that all of the underscore prefixed functions only maintain a cache within a single call to json... Thoughts?

Also, feeling pretty okay about the frozendict requirement. It helps a lot with making everything nicely hashable without too much fuss.

You might want to cross check with Paul, since he's been doing the async/multithreaded stuff.

hudson-ai commented 3 weeks ago

@paulbkoch would love your input on the caching business here :)

hudson-ai commented 2 weeks ago

Currently, the performance gain on actually constructing the grammar doesn't seem super affected by this PR (I think thanks to #1007, which provided a significant boost on its own).

However, caching is still needed -- otherwise we hit a limit on the number of symbols that llguidance currently allows in a grammar. Raising that limit is a non-fix, as it has implications about the performance of mask computation.

paulbkoch commented 1 week ago

An open question: is top/module-level caching a good idea? I could add a layer of encapsulation such that all of the underscore prefixed functions only maintain a cache within a single call to json... Thoughts?

The one scenario I could think of where per-json call caching might be superior is in the case of a long running workload with constantly changing JSON schemas that are unique per call. In that case module level caching will keep consuming more and more memory. Given the JSON schemas and grammars are pretty compact, it would probably take a long time to exhaust memory though, so it's probably not the most important thing to solve. In terms of async, I can't think of an issue this caching would create, at least not after your other PR on fixing threaded grammars @hudson-ai.

The rest of the PR looked fine to me. I can't say that I understood all the JSON schema nuances here, but the application of caching to the existing code looked good to me.

hudson-ai commented 6 days ago

The one scenario I could think of where per-json call caching might be superior is in the case of a long running workload with constantly changing JSON schemas that are unique per call. In that case module level caching will keep consuming more and more memory. Given the JSON schemas and grammars are pretty compact, it would probably take a long time to exhaust memory though, so it's probably not the most important thing to solve. In terms of async, I can't think of an issue this caching would create, at least not after your other PR on fixing threaded grammars @hudson-ai.

Agreed. If I can get my other PR on method decoration together, it would make it really easy to cache on a per-schema basis (one schema = one instance of a GenJSON class, methods are cached at the instance level...). If there are no objections, it would be great to merge this PR (and per-schema caching could go in a future one).

Just want to reiterate that this introduces a new dependency in frozendict