simonw / sqlite-utils-ask

Ask questions of your data with LLM assistance
Apache License 2.0
23 stars 0 forks source link

`-e/--examples` option for sending examples of column values #8

Closed simonw closed 3 months ago

simonw commented 3 months ago

Demo:

sqlite-utils ask content.db 'show me the bots' -e
{
    "sql": "SELECT * FROM users WHERE type = 'Bot'",
    "results": [
        {
            "login": "github-actions[bot]",
            "id": 41898282,
            "node_id": "MDM6Qm90NDE4OTgyODI=",
            "avatar_url": "https://avatars.githubusercontent.com/in/15368?v=4",
            "gravatar_id": "",
            "html_url": "https://github.com/apps/github-actions",
            "type": "Bot",
            "site_admin": 0,
            "name": "github-actions[bot]"
        }
    ]
}

It knew to query for type = 'Bot' because the examples included this:

{
    "users": {
        "login": [
            "simonw",
            "eyeseast",
            "bretwalker",
            "cldellow",
            "drkane"
        ],
        "node_id": [
            "MDQ6VXNlcjk1OTk=",
            "MDQ6VXNlcjI1Nzc4",
            "MDQ6VXNlcjE4MTY5OA==",
            "MDQ6VXNlcjE5MzE4NQ==",
            "MDQ6VXNlcjEwNDk5MTA="
        ],
        "gravatar_id": [],
        "html_url": [
            "https://github.com/simonw",
            "https://github.com/eyeseast",
            "https://github.com/bretwalker",
            "https://github.com/cldellow",
            "https://github.com/drkane"
        ],
        "type": [
            "User",
            "Organization",
            "Bot"
        ],
        "name": [
            "simonw",
            "eyeseast",
            "bretwalker",
            "cldellow",
            "drkane"
        ]
    }
}
simonw commented 3 months ago

My worry about this feature is that it can really increase the size of the context, plus I'm not yet convinced I'm sending examples in the right way - can I do better than nested JSON like this? Something more YAML-ish would probably work better, maybe:

examples:
  users:
    type: ["User", "Organization", "Bot"]

Or even:

examples:
  users.type:
  - User
  - Organization
  - Bot

As always, the challenge here is evals - especially since this plugin works against multiple models. Which of these patterns works best? That's a really hard experiment to design.

simonw commented 3 months ago

Sending examples for every table in the schema is wasteful if we have an idea of which tables we are going to use. A couple of options:

  1. The user can optionally specify which table they want to use, and if they do that we only send examples for those tables.
  2. We round-trip once through the LLM to find out which tables it thinks are relevant, then we just send examples from those tables.
simonw commented 3 months ago

I'm going to land the inefficient version of this for now.