simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
4.74k stars 263 forks source link

Store prompts and system prompts longer than X characters in de-duped table #617

Open simonw opened 6 days ago

simonw commented 6 days ago

The attachments feature makes it easy to attach video and audio from a URL or file while ensuring that even if that prompt component is used many times over it only ends up stored once in the database - in the attachments table using a sha256 hash as the ID: https://github.com/simonw/llm/blob/febbc04fb6d847bd35ce5ba0156ef010f5894564/llm/models.py#L28-L39

I find myself wanting the same thing for blocks of text: I'd like to construct e.g. a context that contains the full docs for LLM and run multiple prompts against that without storing many copies of the same exact text in the database.

One possible solution would be to re-use the attachments mechanism, with a new default plain text attachment type that all models support:

llm -m gpt-4o-mini \
  -a https://gist.githubusercontent.com/simonw/f7b251a05b834a9b60edff7e06d31572/raw/3ad2fb03d57aad7b616ec40455c355ec51d2e3db/llm-docs.md \
  'how do I embed a file?'
simonw commented 6 days ago

Pros of this approach:

Cons:

simonw commented 6 days ago

Alternative approaches:

That last option is actually reasonably elegant. It feels a bit nasty to me to have a whole bunch of very short values that are stored in a separate table, but it may be that most prompts and system prompts are long enough that this might be worthwhile anyway.

simonw commented 6 days ago

Ideally whatever mechanism I use here should work for both system and regular prompts.

If using existing attachments we could invent artificial content types of llm/prompt and llm/system for this.

simonw commented 6 days ago

Question: in my current database how many responses have duplicates in prompts or system prompts?

simonw commented 6 days ago

Simple dupe detection query:

sqlite-utils "$(llm logs path)" "
select substr(prompt, 0, 20), length(prompt), count(*), length(prompt) * count(*) as size
from responses
group by prompt
having count(*) > 1 order by size desc" -t
substr(prompt, 0, 20)      length(prompt)    count(*)    size
-----------------------  ----------------  ----------  ------
<documents>                        332310           2  664620
<docume
src/util.rs                        288313           2  576626
---
use
<documents>                        112753           4  451012
<docume
jasondavies:                        95647           4  382588
shnkr
882542F3884314B:                   152700           2  305400
c
null                               147256           2  294512
<p>Something t
diff --git a/.githu                111221           2  222442
null                               109634           2  219268
<p>Something t
zerojames:                         102411           2  204822
jayniz:
Start Timestamp;End                 92772           2  185544
#!/usr/bin/env pyth                 54755           3  164265
## v2.7.3                           53962           3  161886

*2023-08
simonw commented 6 days ago

I think I need to prototype the approach where every prompt and system string is de-duped into a separate foreign key table, then run it against a copy of my database and compare the sizes.

simonw commented 6 days ago

Prototype:

cp "$(llm logs path)" /tmp/logs.db                                                                       
sqlite-utils extract logs.db responses system --table prompt --rename system prompt --fk-column system_id
sqlite-utils extract logs.db responses prompt                                                            
simonw commented 6 days ago

The resulting DB was even bigger for some reason. A lot of the prompts/system prompts are duplicated in the prompt_json already.

simonw commented 6 days ago

Oh here's why it's bigger, it put an index on the new table:

CREATE UNIQUE INDEX [idx_prompt_prompt]
    ON [prompt] ([prompt]);

I really need to do the thing where there's a hash column which has an index on it and is used for lookups, but the foreign key remains an integer so we don't bloat the responses table.

simonw commented 6 days ago

Dropping that index and running vacuum dropped the DB size from 170MB to 144MB, but the original is 136MB to we still gained 8MB somehow. Probably just because we needed a bunch of extra pages for the new table.

simonw commented 6 days ago

Given these problems with the prompt_id / system_id approach I'm inclined to consider that original -a long-file.txt attachment idea instead.

simonw commented 6 days ago

Or I could have optional prompt_id and system_id columns which foreign key against either attachments or some new prompts table - but continue to use the existing prompt and system columns for short prompts.

simonw commented 6 days ago

If I insert the prompts as utf-8 binary data into that content column they look like this in Datasette:

CleanShot 2024-11-06 at 14 01 40@2x

Inserting them as text DOES work, but it feels really gnarly to have a BLOB column that sometimes has text in it, plus in the code Attachment.content is typed as a bytes: https://github.com/simonw/llm/blob/febbc04fb6d847bd35ce5ba0156ef010f5894564/llm/models.py#L20-L26

simonw commented 6 days ago

Cleanest option here would be to have a content_text column which is a string and use that instead of content for non-binary attachment content.

Google Gemini models actually DO accept plain text attachments already: https://ai.google.dev/gemini-api/docs/document-processing?lang=rest

Gemini 1.5 Pro and 1.5 Flash support a maximum of 3,600 document pages. Document pages must be in one of the following text data MIME types:

  • PDF - application/pdf
  • JavaScript - application/x-javascript, text/javascript
  • Python - application/x-python, text/x-python
  • TXT - text/plain
  • HTML - text/html
  • CSS - text/css
  • Markdown - text/md
  • CSV - text/csv
  • XML - text/xml
  • RTF - text/rtf

Each document page is equivalent to 258 tokens.

simonw commented 6 days ago

So maybe this ticket is actually about supporting plain text attachments and having them work across ALL models, in addition to whatever is going on here with Gemini.

I wonder if any other models out there support attachments like text/md and text/xml and suchlike?

simonw commented 6 days ago

I started playing with a llm-docs idea that could benefit from this: https://github.com/simonw/llm-docs

I currently run that like this:

curl -s "https://raw.githubusercontent.com/simonw/llm-docs/refs/heads/main/version-docs/$(llm --version | cut -d' ' -f3).txt" | \
  llm -m gpt-4o-mini 'how do I embed a binary file?'
simonw commented 5 days ago

Here's a much simpler idea. I introduce two new llm prompt arguments:

And maybe two more that do the same for system prompts:

In addition, I treat cat prompt.txt | llm as working the same as llm -f -

In all cases the text is anticipated to be longer than a regular prompt, and hence should be de-duplicated and stored in a separate table.

That table might be called prompts or inputs or maybe even contexts - it will then have two nullable foreign keys from responses for prompt_id and system_id.

This avoids the confusion that would come from reusing the existing attachments mechanism, and the UI for it feels pretty natural to me - especially since -f and -u are not yet used for anything else.

simonw commented 5 days ago

Current database tables are:

I'm going to call the new table contexts - I think prompts is confusing as those are already stored as part of responses and I like that it reflects the noun context which is key to thinking about how to use LLMs.

simonw commented 5 days ago

How should these affect the existing templates feature?

One option: templates work exactly as they do right now, so when you do llm -f prompt.txt --save foo a foo.yaml template is created with the contents of that file.

But... this actually brings up the point that templates right now are pretty inefficient. Using a template 100 times will drop 100 copies of that template into the responses table, one for each response!

Fixing this is probably a separate issue. I could start having templates automatically use this mechanism - being treated effectively as piped in / -f prompts and system prompts. But... that won't work for templates that include replacement params.

I could make it so replacement params work against these -u and -f inputs, and store which params were used for a given prompt. I'd need to add a params optional JSON column to responses for this.

I can do that as a separate piece of work.

simonw commented 5 days ago

If I'm doing this - supporting longer prompts using --file and --url - I could implement it such that IF one of those things is provided then the prompt is automatically stashed in the contexts table.

But as I started prototyping that out I realized that an alternative rule could simple be that any prompts over the length of X characters are automatically stored in that table instead.

This also isn't a question I can get wrong. If I decide to make all prompts longer than 50 characters stored in contexts instead, and then later decide that was a bad idea, I can change the policy without breaking anything - the code I've written that knows how to read from either responses.prompt or the thing pointed to by responses.prompt_id will continue to work regardless.

simonw commented 5 days ago

... in which case the --file and --url options are a distraction. That feature exists already in the form of curl ... | llm and llm < prompt.txt - if the only reason I was adding those is as a signal that longer prompts should be stored independently I don't need the new features at all if I go with a "prompts longer than X" design instead.

simonw commented 5 days ago

Here's as far as I got prototyping the --file and --url options:

diff --git a/llm/cli.py b/llm/cli.py
index 6a6fb2c..0f5f112 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -147,6 +147,16 @@ def cli():
 @click.argument("prompt", required=False)
 @click.option("-s", "--system", help="System prompt to use")
 @click.option("model_id", "-m", "--model", help="Model to use")
+@click.option("file", "-f", "--file", type=click.File(), help="Read prompt from file")
+@click.option("url", "-u", "--url", help="Read prompt from URL")
+@click.option(
+    "system_file",
+    "--sf",
+    "--system-file",
+    type=click.File(),
+    help="Read system prompt from file",
+)
+@click.option("system_url", "--su", "--system-url", help="Read system prompt from URL")
 @click.option(
     "attachments",
     "-a",
@@ -203,6 +213,10 @@ def prompt(
     prompt,
     system,
     model_id,
+    file,
+    url,
+    system_file,
+    system_url,
     attachments,
     attachment_types,
     options,
@@ -245,6 +259,16 @@ def prompt(
     def read_prompt():
         nonlocal prompt

+        prompt_bits = []
+
+        # Is there a file to read from?
+        if file:
+            prompt_bits.append(file.read())
+        if url:
+            response = httpx.get(url)
+            response.raise_for_status()
+            prompt_bits.append(response.text)
+
         # Is there extra prompt available on stdin?
         stdin_prompt = None
         if not sys.stdin.isatty():
@@ -258,10 +282,12 @@ def prompt(

         if (
             prompt is None
-            and not save
+            and file is None
+            and url is None
             and sys.stdin.isatty()
             and not attachments
             and not attachment_types
+            and not save
         ):
             # Hang waiting for input to stdin (unless --save)
             prompt = sys.stdin.read()
simonw commented 5 days ago

I think I need to change this code that populates the responses table:

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/models.py#L221-L238

And code that reads from that table in two places:

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/cli.py#L685-L706

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/models.py#L284-L319

simonw commented 5 days ago

I could even define a new responses_full SQL view that inflates the foreign key system and regular prompts automatically, for developer convenience. It could extract out the attachments as JSON too.

simonw commented 5 days ago

But what do I do about the prompt_json columns storing huge copies of the strings anyway?

I worked up this little system with the help of ChatGPT Code Interpreter: https://chatgpt.com/share/672d84c0-7360-8006-8333-f7e743237942

def apply_replacements(obj, replacements):
    if isinstance(obj, dict):
        return {k: apply_replacements(v, replacements) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [apply_replacements(item, replacements) for item in obj]
    elif isinstance(obj, str):
        replaced_parts = []
        last_index = 0
        found = False

        for key, value in replacements.items():
            index = obj.find(key)
            while index != -1:
                found = True
                if index > last_index:
                    replaced_parts.append(obj[last_index:index])
                replaced_parts.append(value)
                last_index = index + len(key)
                index = obj.find(key, last_index)

        if found:
            if last_index < len(obj):
                replaced_parts.append(obj[last_index:])
            return {"$r": replaced_parts}
        else:
            return obj
    else:
        return obj

def reverse_replacements(obj, replacements):
    return _reverse_replacements(obj, {v: k for k, v in replacements.items()})

def _reverse_replacements(obj, replacements):
    if isinstance(obj, dict):
        if "$r" in obj:
            # Reconstruct the original string from the list
            return "".join(
                (replacements[part] if isinstance(part, int) else part)
                for part in obj["$r"]
            )
        else:
            return {k: _reverse_replacements(v, replacements) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [_reverse_replacements(item, replacements) for item in obj]
    else:
        return obj

So now you can do this:

from llm.utils import apply_replacements, reverse_replacements
replacements = {"This is a pretty long string at this point": 1, "And so is this one": 2}
json_object = {
    "foo": {
        "bar": {
            "baz": "this includes This is a pretty long string at this point",
            "qux": [44, "And so is this one"],
            "quux": "This has This is a pretty long string at this point and And so is this one as well"
        }
    }
}

replaced = apply_replacements(json_object, replacements)
orig = reverse_replacements(replaced, replacements)

And the replaced version looks like this:

{'foo': {'bar': {'baz': {'$r': ['this includes ', 1]},
                 'quux': {'$r': ['This has ', 1, ' and ', 2, ' as well']},
                 'qux': [44, {'$r': [2]}]}}}
simonw commented 4 days ago

The last remaining challenge is search. The system currently has a responses_fts table that indexes the prompt and response columns from responses:

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/docs/logging.md#L164-L168

That design doesn't work if some of the prompts are in responses.prompt and others are in the new contexts table and related by a foreign key instead.

I could probably denormalize this, but that seems like it could be wasteful since the whole point of this exercise is to avoid storing the same long prompts hundreds of times... but if we store hundreds of copies in the FTS index table have we really improved things?

So I likely need a separate contexts_fts table... but now I'm stuck trying to do relevance scoring across two different FTS tables, which doesn't really work.

I think I'll have to accept a reduction in relevance scoring because of this. That's OK - the search feature isn't really a signature big deal, it's more of a convenience. I don't think it matters too much if the relevance scoring is a bit out of whack.

simonw commented 4 days ago

Note that response is always stored separately in the responses table - it's only prompt (of the two columns in the responses_fts table) that is affected by this new table design.

simonw commented 4 days ago

I think the migrations I added broke response_fts entirely, maybe by rewriting the rowid values when I ran .transform() to move some table columns around such that I need a new migration to re-index the table entirely.

simonw commented 14 hours ago

Pushed my WIP to a branch: https://github.com/simonw/llm/commit/b2fce50aadae4a48f7eac64ab4417a806040a9e8

simonw commented 14 hours ago

I'm second-guessing the design for this feature again now.

As implemented this kicks in by magic if the prompt is longer than a certain threshold:

https://github.com/simonw/llm/blob/b2fce50aadae4a48f7eac64ab4417a806040a9e8/llm/models.py#L230-L236

This assumes that prompts and system prompts will be entirely duplicated occasionally.

But maybe that's not the smartest way to do this. What if instead these reusable contexts could be specified by the user directly and could even be concatenated together?

Imagine being able to do something like this:

llm -f readme.md -f docs/usage.md 'How do I install this?'

Here we are concatenating two files together. Those files could be stored in two records in contexts, which would allow them to be combined in different ways in the future while still avoiding duplicating storage for them.