simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
4.04k stars 224 forks source link

Structured Extraction w/ Function Calls #113

Open jxnl opened 1 year ago

jxnl commented 1 year ago

Hey I wrote a small lib around using function calls and pydatnic for extraction and i think some ideas could be appleid via the CLI or API I really like the CLI interface so i'd love to brain some some ways of doing the extraction via CLI.

Heres an example from my docs

The simplest is defining a model and making the model use it

class UserDetails(BaseModel):
    "Correctly extracted user information"
    name: str = Field(..., description="User's full name")
    age: int

Suggestion:

model("gpt-4", response_model=UserDetails)

In the template you could even support some pattern like:

model: gpt-4
system: Extract out a user detail
prompt: "look at {input} and ..."
response_model:
    - UserDetail
       description: ...
       parameters: 
           - name
             type: str 
            ...

But a more interesting use case could be configuring it to extract Multiple UserDetails

class UserDetails(BaseModel):
    "Correctly extracted user information"
    name: str = Field(..., description="User's full name")
    age: int

class MultipleDetails(BaseModel):
    batch: List[UserDetails]

If this was implemented well you could imagine having some cli tool out as jsonlines vs a single json object. It could even be streamed!

curl -s '...' | \
  strip-tags -m | llm --extract-multiple schema.json

returns

{name: .., age: ...}
{name: .., age: ...}
cmungall commented 1 year ago

I like this idea. It would be useful to have the ability to reference an existing pydantic model e.g. via package path, rather than making the user do a compilation step and copy the results into their templates

@simonw mentions the possibility of adding an extract command in #66

I'm also very in a command like this, I was thinking of implementing in a lib that depends on llm, but even better if it can be incorporated into llm.

I was thinking of implementing this in a way that can be made independent of openai/functions, and allow the user to choose between strategies:

  1. Use openai functions, if available
  2. just ask for the json/yaml directly, providing either in-context examples of a description of the schema
  3. Using a recursive descent approach, as in our SPIRES algorithm, as implemented in our OntoGPT (this package does a lot more so the idea would be to scoop out the minimal core into a separate llm-extract package)

the second is highly error prone, but has the advantage of working with llama2 etc

Perhaps the last would be best suited to a plugin?

I'm also interested in approaches to integrate in-context examples here. It's not clear how well these work for the functions approach, the library would allow the user some choice of strategy here.

cmungall commented 1 month ago

@jxnl instructor seems great, and well supported. Have you had any further thoughts about how to leverage it in combination with llm? Currently using both together is a little awkward as both have different ways of abstracting over different LLMs.