refuel-ai / autolabel

Label, clean and enrich text datasets with LLMs.
https://docs.refuel.ai/
MIT License
2.03k stars 139 forks source link

Guided decoding integration for autolabel #898

Closed DhruvaBansal00 closed 2 days ago

DhruvaBansal00 commented 3 weeks ago

Pull Review Summary

Description

A summary of the change. Please also include relevant motivation and context. This could include links to any docs/Slack threads/Github issues other artifacts.

Type of change

Tests

Please describe the tests that you ran to verify your changes. This could include a test plan you executed locally, unit tests/integration tests that were run to verify the change works as expected.

Make sure to include screenshots, API response, log statements etc that point to the test being successful.

Put closes #XXXX in your comment to auto-close the issue that your PR addresses.

DhruvaBansal00 commented 2 days ago

@nihit @tuxracer heads up here - we have to remove all instances of additionalProperties from the supplied JSON Schema for refuel models only due to a restriction from lm-format-enforcer (the backend we use for guided decoding). However, since OpenAI requires us sending this parameter (https://platform.openai.com/docs/guides/structured-outputs/supported-schemas) we can't just ignore it. Added additional methods for removing them post schema generation. Flagging this mostly since this small things breaks inference and we should push lm-format-enforcer to fix this so that client side code is cleaner.

Issue in lm-format-enforcer being tracked here: https://github.com/noamgat/lm-format-enforcer/issues/129