jxnl / instructor

structured outputs for llms
https://python.useinstructor.com/
MIT License
6.52k stars 514 forks source link

Lower cost of tokens via type definitions #768

Open anishthite opened 2 weeks ago

anishthite commented 2 weeks ago

I was reading https://www.boundaryml.com/blog/type-definition-prompting-baml, and it mentioned an optimization for lowering token counts by removing extraneous JSON schema portions

Old:

Screenshot 2024-06-18 at 00 42 30

New:

Screenshot 2024-06-18 at 00 42 39

Has this been tested for Instructor yet? If not, I'd be happy to try it out

Mr-Ruben commented 2 weeks ago

That web nailed it! Really good analysis. https://www.boundaryml.com/blog/type-definition-prompting-baml

It has been my experience since the beginning. I documented it briefly here https://github.com/jxnl/instructor/discussions/497#discussioncomment-8979998

I do by hand what that web describes (with a final twist).

In my initial prompt I include the output format

[ Rest of the prompt ... ]

Output:
var name1: str #  Describe here what it is
var name2: bool #  Describe here what it is 

And then I throw the LLM's response to LLM_with__response_model (using Instructor). That means I do 2 calls to the LLM.

The Model is

class Output(BaseModel):
  var_name1: str
  var_name2: bool

As you can see, is almost a copy/paste of the part in the prompt.

So the first response didn't have to go through the whole Schema, and could just focus on the Task + very lightweight format.

As the web says, the LLM finds that format very easy to understand, even the # comment which without me mentioning/asking it, doesn't appear in the output.


Here https://python.useinstructor.com/concepts/fields/ one can exclude a field from the Schema, add more information to the Schema. But I haven't found a way to reduce the Schema to something less verbose.

See this example:

This model

class Output(BaseModel):
    type_of_atomic_element: str
    name_of_the_compound: str
    date_it_was_first_discovered: str

creates this Schema

{'properties': {'date_it_was_first_discovered': {'title': 'Date It Was First Discovered', 'type': 'string'},
                'name_of_the_compound': {'title': 'Name Of The Compound', 'type': 'string'},
                'type_of_atomic_element': {'title': 'Type Of Atomic Element',  'type': 'string'}},
 'required': ['type_of_atomic_element',
              'name_of_the_compound',
              'date_it_was_first_discovered'],
 'title': 'Output',
 'type': 'object'}

Which removing the totally unnecessary Titles becomes

{'properties': {
                'date_it_was_first_discovered': { 'type': 'string'},
                'name_of_the_compound': {'type': 'string'},
                'type_of_atomic_element': {'type': 'string'}},
 'required': [
              'type_of_atomic_element',
              'name_of_the_compound',
              'date_it_was_first_discovered'],
 'title': 'Output',
 'type': 'object'}

I wish I could at least remove the Titles from the Schema sent.

Some may say to do this instead

class Output(BaseModel):
    type: str = Field(description='type of atomic element')
    name : str = Field(description='name of the compound')
    date : str = Field(description='date it was first discovered')

But it doesn't work (in my experiments) as well as the more 'natural language' approach. Remember that the LLM is keen to write 'what makes sense' (statistically speaking).

Mr-Ruben commented 2 weeks ago

I'll respond myself

Q: Can the schema be modified to not include titles?

A: Yes

Example:

from pydantic import BaseModel

class Output(BaseModel):
    type_of_atomic_element: str
    name_of_the_compound: str
    date_it_was_first_discovered: str

    @classmethod
    def model_json_schema(cls, **kwargs):
        schema = super().model_json_schema(**kwargs)
        schema.pop("title", None)
        for field_name, field_props in schema["properties"].items():
            field_props.pop("title", None)
        return schema

print(Output.model_json_schema())
{'properties': {'date_it_was_first_discovered': {'type': 'string'},
                'name_of_the_compound': {'type': 'string'},
                'type_of_atomic_element': {'type': 'string'}},
 'required': ['type_of_atomic_element',
              'name_of_the_compound',
              'date_it_was_first_discovered'],
 'type': 'object'}

As it would be annoying to have to add that to every class definition, one can create a class decorator like

def remove_title_from_schema(cls):
    """Decorator to remove titles from model_json_schema output."""
    original_schema = cls.model_json_schema

    @classmethod
    def wrapped_schema(cls, **kwargs):
        schema = original_schema(**kwargs)
        # schema.pop("title", None)
        for field_name, field_props in schema["properties"].items():
            field_props.pop("title", None)
        return schema

    cls.model_json_schema = wrapped_schema
    return cls

Note: I commented out the removal of the title for the class object, just in case.

Which makes it very easy to apply to new classes

# Usage Example
@remove_title_from_schema
class Output(BaseModel):
    type_of_atomic_element: str
    name_of_the_compound: str
    date_it_was_first_discovered: str

# This line demonstrates calling the modified method
print(Output.model_json_schema())

Output

{'properties': {'date_it_was_first_discovered': {'type': 'string'},
                'name_of_the_compound': {'type': 'string'},
                'type_of_atomic_element': {'type': 'string'}},
 'required': ['type_of_atomic_element',
              'name_of_the_compound',
              'date_it_was_first_discovered'],
 'title': 'Output',
 'type': 'object'}

Q: Does it work as before?


p="""Helium, the second element on the periodic table with the symbol He, is a key component of the first identified noble gas compound, neon clathrate hydrate. This unique structure, where helium atoms are trapped within cages of water molecules, was first discovered in 1964 through a combination of X-ray diffraction and nuclear magnetic resonance spectroscopy."""

response=call_ai_with_class(prompt=p, response_model=Output)  # This is my wrapper for Instructor

print(response.output)
{'date_it_was_first_discovered': '1964',
 'name_of_the_compound': 'neon clathrate hydrate',
 'type_of_atomic_element': 'noble gas'}

Apparently it does.


Here are the LOGS for those interested in what is going on behind the curtain.

19:43:35 [...] Sending query: 
{'max_retries': 1,
 'max_tokens': 250,
 'messages': [{'content': 'Helium, the second element on the periodic table with the symbol He, is a key component of '
                          'the first identified noble gas compound, neon clathrate hydrate. This unique structure, '
                          'where helium atoms are trapped within cages of water molecules, was first discovered in '
                          '1964 through a combination of X-ray diffraction and nuclear magnetic resonance '
                          'spectroscopy.',
               'role': 'user'}],
 'model': 'gpt-3.5-turbo',
 'n': 1,
 'response_model': <class '__main__.Output'>,
 'temperature': 0.0}
With response_model:
{'properties': {'date_it_was_first_discovered': {'type': 'string'},
                'name_of_the_compound': {'type': 'string'},
                'type_of_atomic_element': {'type': 'string'}},
 'required': ['type_of_atomic_element',
              'name_of_the_compound',
              'date_it_was_first_discovered'],
 'title': 'Output',
 'type': 'object'}

19:43:36 [...] Response: 
{'date_it_was_first_discovered': '1964',
 'name_of_the_compound': 'neon clathrate hydrate',
 'type_of_atomic_element': 'noble gas'}

I don't include how many tokens are saved by this modification because:

  1. It is obvious it depends on the class definition
  2. I am not so much interested in 'saving tokens' as I am in 'reducing the amount of unnecessary garbage that is sent to the LLM' so that it doesn't have to be read and processed. Less garbage, less distractions, more energy to focus on the important stuff.