Open anishthite opened 2 weeks ago
That web nailed it! Really good analysis. https://www.boundaryml.com/blog/type-definition-prompting-baml
It has been my experience since the beginning. I documented it briefly here https://github.com/jxnl/instructor/discussions/497#discussioncomment-8979998
I do by hand what that web describes (with a final twist).
In my initial prompt I include the output format
[ Rest of the prompt ... ]
Output:
var name1: str # Describe here what it is
var name2: bool # Describe here what it is
And then I throw the LLM's response to LLM_with__response_model (using Instructor). That means I do 2 calls to the LLM.
The Model is
class Output(BaseModel):
var_name1: str
var_name2: bool
As you can see, is almost a copy/paste of the part in the prompt.
So the first response didn't have to go through the whole Schema, and could just focus on the Task + very lightweight format.
As the web says, the LLM finds that format very easy to understand, even the # comment
which without me mentioning/asking it, doesn't appear in the output.
Here https://python.useinstructor.com/concepts/fields/ one can exclude a field from the Schema, add more information to the Schema. But I haven't found a way to reduce the Schema to something less verbose.
See this example:
This model
class Output(BaseModel):
type_of_atomic_element: str
name_of_the_compound: str
date_it_was_first_discovered: str
creates this Schema
{'properties': {'date_it_was_first_discovered': {'title': 'Date It Was First Discovered', 'type': 'string'},
'name_of_the_compound': {'title': 'Name Of The Compound', 'type': 'string'},
'type_of_atomic_element': {'title': 'Type Of Atomic Element', 'type': 'string'}},
'required': ['type_of_atomic_element',
'name_of_the_compound',
'date_it_was_first_discovered'],
'title': 'Output',
'type': 'object'}
Which removing the totally unnecessary Titles becomes
{'properties': {
'date_it_was_first_discovered': { 'type': 'string'},
'name_of_the_compound': {'type': 'string'},
'type_of_atomic_element': {'type': 'string'}},
'required': [
'type_of_atomic_element',
'name_of_the_compound',
'date_it_was_first_discovered'],
'title': 'Output',
'type': 'object'}
I wish I could at least remove the Titles from the Schema sent.
Some may say to do this instead
class Output(BaseModel):
type: str = Field(description='type of atomic element')
name : str = Field(description='name of the compound')
date : str = Field(description='date it was first discovered')
But it doesn't work (in my experiments) as well as the more 'natural language' approach. Remember that the LLM is keen to write 'what makes sense' (statistically speaking).
I'll respond myself
Q: Can the schema be modified to not include titles?
A: Yes
Example:
from pydantic import BaseModel
class Output(BaseModel):
type_of_atomic_element: str
name_of_the_compound: str
date_it_was_first_discovered: str
@classmethod
def model_json_schema(cls, **kwargs):
schema = super().model_json_schema(**kwargs)
schema.pop("title", None)
for field_name, field_props in schema["properties"].items():
field_props.pop("title", None)
return schema
print(Output.model_json_schema())
{'properties': {'date_it_was_first_discovered': {'type': 'string'},
'name_of_the_compound': {'type': 'string'},
'type_of_atomic_element': {'type': 'string'}},
'required': ['type_of_atomic_element',
'name_of_the_compound',
'date_it_was_first_discovered'],
'type': 'object'}
As it would be annoying to have to add that to every class definition, one can create a class decorator like
def remove_title_from_schema(cls):
"""Decorator to remove titles from model_json_schema output."""
original_schema = cls.model_json_schema
@classmethod
def wrapped_schema(cls, **kwargs):
schema = original_schema(**kwargs)
# schema.pop("title", None)
for field_name, field_props in schema["properties"].items():
field_props.pop("title", None)
return schema
cls.model_json_schema = wrapped_schema
return cls
Note: I commented out the removal of the title for the class object, just in case.
Which makes it very easy to apply to new classes
# Usage Example
@remove_title_from_schema
class Output(BaseModel):
type_of_atomic_element: str
name_of_the_compound: str
date_it_was_first_discovered: str
# This line demonstrates calling the modified method
print(Output.model_json_schema())
Output
{'properties': {'date_it_was_first_discovered': {'type': 'string'},
'name_of_the_compound': {'type': 'string'},
'type_of_atomic_element': {'type': 'string'}},
'required': ['type_of_atomic_element',
'name_of_the_compound',
'date_it_was_first_discovered'],
'title': 'Output',
'type': 'object'}
Q: Does it work as before?
p="""Helium, the second element on the periodic table with the symbol He, is a key component of the first identified noble gas compound, neon clathrate hydrate. This unique structure, where helium atoms are trapped within cages of water molecules, was first discovered in 1964 through a combination of X-ray diffraction and nuclear magnetic resonance spectroscopy."""
response=call_ai_with_class(prompt=p, response_model=Output) # This is my wrapper for Instructor
print(response.output)
{'date_it_was_first_discovered': '1964',
'name_of_the_compound': 'neon clathrate hydrate',
'type_of_atomic_element': 'noble gas'}
Apparently it does.
Here are the LOGS for those interested in what is going on behind the curtain.
19:43:35 [...] Sending query:
{'max_retries': 1,
'max_tokens': 250,
'messages': [{'content': 'Helium, the second element on the periodic table with the symbol He, is a key component of '
'the first identified noble gas compound, neon clathrate hydrate. This unique structure, '
'where helium atoms are trapped within cages of water molecules, was first discovered in '
'1964 through a combination of X-ray diffraction and nuclear magnetic resonance '
'spectroscopy.',
'role': 'user'}],
'model': 'gpt-3.5-turbo',
'n': 1,
'response_model': <class '__main__.Output'>,
'temperature': 0.0}
With response_model:
{'properties': {'date_it_was_first_discovered': {'type': 'string'},
'name_of_the_compound': {'type': 'string'},
'type_of_atomic_element': {'type': 'string'}},
'required': ['type_of_atomic_element',
'name_of_the_compound',
'date_it_was_first_discovered'],
'title': 'Output',
'type': 'object'}
19:43:36 [...] Response:
{'date_it_was_first_discovered': '1964',
'name_of_the_compound': 'neon clathrate hydrate',
'type_of_atomic_element': 'noble gas'}
I don't include how many tokens are saved by this modification because:
I was reading https://www.boundaryml.com/blog/type-definition-prompting-baml, and it mentioned an optimization for lowering token counts by removing extraneous JSON schema portions
Old:
New:
Has this been tested for Instructor yet? If not, I'd be happy to try it out