Option for passing in longer context fragments, stored in a deduped table

simonw commented 2 weeks ago

The attachments feature makes it easy to attach video and audio from a URL or file while ensuring that even if that prompt component is used many times over it only ends up stored once in the database - in the attachments table using a sha256 hash as the ID: https://github.com/simonw/llm/blob/febbc04fb6d847bd35ce5ba0156ef010f5894564/llm/models.py#L28-L39

I find myself wanting the same thing for blocks of text: I'd like to construct e.g. a context that contains the full docs for LLM and run multiple prompts against that without storing many copies of the same exact text in the database.

One possible solution would be to re-use the attachments mechanism, with a new default plain text attachment type that all models support:

llm -m gpt-4o-mini \
  -a https://gist.githubusercontent.com/simonw/f7b251a05b834a9b60edff7e06d31572/raw/3ad2fb03d57aad7b616ec40455c355ec51d2e3db/llm-docs.md \
  'how do I embed a file?'

simonw commented 2 weeks ago

Pros of this approach:

Reuses existing mechanism, no new tables or columns needed
Neatly solves the need for reused long text prompts

Cons:

A bit confusing: attachments were for multi-modal models, now we have a special type that does something else?
Potentially confusing UI: llm -a URL could point to all kinds of text files, how do we know which ones get pasted into the prompt?
The attachments table uses a BLOB for the content, but we want to store strings. We can encode those as utf-8 in a blob but that might make querying the database directly confusing.
It sucks that you now have to join against attachments just to see what the text version of the prompt was!

simonw commented 2 weeks ago

Alternative approaches:

Do nothing. Store long context prompts many many times in the same table.
Introduce a new non-attachments mechanism for this, so e.g. the responses table can optionally use a foreign key to a new reused_text table which works a bit like attachments do right now.
A mechanism where a response can point to some other response and say I used the same prompt as this one.
ALL prompt strings are moved to a new table with a SHA256 ID column, and all responses have foreign keys to that for their prompt_id and system_id.

That last option is actually reasonably elegant. It feels a bit nasty to me to have a whole bunch of very short values that are stored in a separate table, but it may be that most prompts and system prompts are long enough that this might be worthwhile anyway.

simonw commented 2 weeks ago

Ideally whatever mechanism I use here should work for both system and regular prompts.

If using existing attachments we could invent artificial content types of llm/prompt and llm/system for this.

simonw commented 2 weeks ago

Question: in my current database how many responses have duplicates in prompts or system prompts?

simonw commented 2 weeks ago

Simple dupe detection query:

sqlite-utils "$(llm logs path)" "
select substr(prompt, 0, 20), length(prompt), count(*), length(prompt) * count(*) as size
from responses
group by prompt
having count(*) > 1 order by size desc" -t

substr(prompt, 0, 20)      length(prompt)    count(*)    size
-----------------------  ----------------  ----------  ------
<documents>                        332310           2  664620
<docume
src/util.rs                        288313           2  576626
---
use
<documents>                        112753           4  451012
<docume
jasondavies:                        95647           4  382588
shnkr
882542F3884314B:                   152700           2  305400
c
null                               147256           2  294512
<p>Something t
diff --git a/.githu                111221           2  222442
null                               109634           2  219268
<p>Something t
zerojames:                         102411           2  204822
jayniz:
Start Timestamp;End                 92772           2  185544
#!/usr/bin/env pyth                 54755           3  164265
## v2.7.3                           53962           3  161886

*2023-08

simonw commented 2 weeks ago

I think I need to prototype the approach where every prompt and system string is de-duped into a separate foreign key table, then run it against a copy of my database and compare the sizes.

simonw commented 2 weeks ago

Prototype:

cp "$(llm logs path)" /tmp/logs.db                                                                       
sqlite-utils extract logs.db responses system --table prompt --rename system prompt --fk-column system_id
sqlite-utils extract logs.db responses prompt

simonw commented 2 weeks ago

The resulting DB was even bigger for some reason. A lot of the prompts/system prompts are duplicated in the prompt_json already.

simonw commented 2 weeks ago

Oh here's why it's bigger, it put an index on the new table:

CREATE UNIQUE INDEX [idx_prompt_prompt]
    ON [prompt] ([prompt]);

I really need to do the thing where there's a hash column which has an index on it and is used for lookups, but the foreign key remains an integer so we don't bloat the responses table.

simonw commented 2 weeks ago

Dropping that index and running vacuum dropped the DB size from 170MB to 144MB, but the original is 136MB to we still gained 8MB somehow. Probably just because we needed a bunch of extra pages for the new table.

simonw commented 2 weeks ago

Given these problems with the prompt_id / system_id approach I'm inclined to consider that original -a long-file.txt attachment idea instead.

simonw commented 2 weeks ago

Or I could have optional prompt_id and system_id columns which foreign key against either attachments or some new prompts table - but continue to use the existing prompt and system columns for short prompts.

simonw commented 2 weeks ago

If I insert the prompts as utf-8 binary data into that content column they look like this in Datasette:

CleanShot 2024-11-06 at 14 01 40@2x

Inserting them as text DOES work, but it feels really gnarly to have a BLOB column that sometimes has text in it, plus in the code Attachment.content is typed as a bytes: https://github.com/simonw/llm/blob/febbc04fb6d847bd35ce5ba0156ef010f5894564/llm/models.py#L20-L26

simonw commented 2 weeks ago

Cleanest option here would be to have a content_text column which is a string and use that instead of content for non-binary attachment content.

Google Gemini models actually DO accept plain text attachments already: https://ai.google.dev/gemini-api/docs/document-processing?lang=rest

Gemini 1.5 Pro and 1.5 Flash support a maximum of 3,600 document pages. Document pages must be in one of the following text data MIME types:

PDF - application/pdf

JavaScript - application/x-javascript, text/javascript

Python - application/x-python, text/x-python

TXT - text/plain

HTML - text/html

CSS - text/css

Markdown - text/md

CSV - text/csv

XML - text/xml

RTF - text/rtf

Each document page is equivalent to 258 tokens.

simonw commented 2 weeks ago

So maybe this ticket is actually about supporting plain text attachments and having them work across ALL models, in addition to whatever is going on here with Gemini.

I wonder if any other models out there support attachments like text/md and text/xml and suchlike?

simonw commented 2 weeks ago

I started playing with a llm-docs idea that could benefit from this: https://github.com/simonw/llm-docs

I currently run that like this:

curl -s "https://raw.githubusercontent.com/simonw/llm-docs/refs/heads/main/version-docs/$(llm --version | cut -d' ' -f3).txt" | \
  llm -m gpt-4o-mini 'how do I embed a binary file?'

simonw commented 2 weeks ago

Here's a much simpler idea. I introduce two new llm prompt arguments:

llm -f prompt.txt - runs the prompt from that file (alias --file)
llm -u https://... - runs the prompt fetched from that URL (alias --url)

And maybe two more that do the same for system prompts:

llm --sf prompt.txt - alias --system-file
llm --su https://... - alias --system-url

In addition, I treat cat prompt.txt | llm as working the same as llm -f -

In all cases the text is anticipated to be longer than a regular prompt, and hence should be de-duplicated and stored in a separate table.

That table might be called prompts or inputs or maybe even contexts - it will then have two nullable foreign keys from responses for prompt_id and system_id.

This avoids the confusion that would come from reusing the existing attachments mechanism, and the UI for it feels pretty natural to me - especially since -f and -u are not yet used for anything else.

simonw commented 2 weeks ago

Current database tables are:

conversations
responses
attachments
prompt_attachments

I'm going to call the new table contexts - I think prompts is confusing as those are already stored as part of responses and I like that it reflects the noun context which is key to thinking about how to use LLMs.

simonw commented 2 weeks ago

How should these affect the existing templates feature?

One option: templates work exactly as they do right now, so when you do llm -f prompt.txt --save foo a foo.yaml template is created with the contents of that file.

But... this actually brings up the point that templates right now are pretty inefficient. Using a template 100 times will drop 100 copies of that template into the responses table, one for each response!

Fixing this is probably a separate issue. I could start having templates automatically use this mechanism - being treated effectively as piped in / -f prompts and system prompts. But... that won't work for templates that include replacement params.

I could make it so replacement params work against these -u and -f inputs, and store which params were used for a given prompt. I'd need to add a params optional JSON column to responses for this.

I can do that as a separate piece of work.

simonw commented 2 weeks ago

If I'm doing this - supporting longer prompts using --file and --url - I could implement it such that IF one of those things is provided then the prompt is automatically stashed in the contexts table.

But as I started prototyping that out I realized that an alternative rule could simple be that any prompts over the length of X characters are automatically stored in that table instead.

This also isn't a question I can get wrong. If I decide to make all prompts longer than 50 characters stored in contexts instead, and then later decide that was a bad idea, I can change the policy without breaking anything - the code I've written that knows how to read from either responses.prompt or the thing pointed to by responses.prompt_id will continue to work regardless.

simonw commented 2 weeks ago

... in which case the --file and --url options are a distraction. That feature exists already in the form of curl ... | llm and llm < prompt.txt - if the only reason I was adding those is as a signal that longer prompts should be stored independently I don't need the new features at all if I go with a "prompts longer than X" design instead.

simonw commented 2 weeks ago

Here's as far as I got prototyping the --file and --url options:

diff --git a/llm/cli.py b/llm/cli.py
index 6a6fb2c..0f5f112 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -147,6 +147,16 @@ def cli():
 @click.argument("prompt", required=False)
 @click.option("-s", "--system", help="System prompt to use")
 @click.option("model_id", "-m", "--model", help="Model to use")
+@click.option("file", "-f", "--file", type=click.File(), help="Read prompt from file")
+@click.option("url", "-u", "--url", help="Read prompt from URL")
+@click.option(
+    "system_file",
+    "--sf",
+    "--system-file",
+    type=click.File(),
+    help="Read system prompt from file",
+)
+@click.option("system_url", "--su", "--system-url", help="Read system prompt from URL")
 @click.option(
     "attachments",
     "-a",
@@ -203,6 +213,10 @@ def prompt(
     prompt,
     system,
     model_id,
+    file,
+    url,
+    system_file,
+    system_url,
     attachments,
     attachment_types,
     options,
@@ -245,6 +259,16 @@ def prompt(
     def read_prompt():
         nonlocal prompt

+        prompt_bits = []
+
+        # Is there a file to read from?
+        if file:
+            prompt_bits.append(file.read())
+        if url:
+            response = httpx.get(url)
+            response.raise_for_status()
+            prompt_bits.append(response.text)
+
         # Is there extra prompt available on stdin?
         stdin_prompt = None
         if not sys.stdin.isatty():
@@ -258,10 +282,12 @@ def prompt(

         if (
             prompt is None
-            and not save
+            and file is None
+            and url is None
             and sys.stdin.isatty()
             and not attachments
             and not attachment_types
+            and not save
         ):
             # Hang waiting for input to stdin (unless --save)
             prompt = sys.stdin.read()

simonw commented 2 weeks ago

I think I need to change this code that populates the responses table:

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/models.py#L221-L238

And code that reads from that table in two places:

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/cli.py#L685-L706

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/models.py#L284-L319

simonw commented 2 weeks ago

I could even define a new responses_full SQL view that inflates the foreign key system and regular prompts automatically, for developer convenience. It could extract out the attachments as JSON too.

simonw commented 2 weeks ago

But what do I do about the prompt_json columns storing huge copies of the strings anyway?

I worked up this little system with the help of ChatGPT Code Interpreter: https://chatgpt.com/share/672d84c0-7360-8006-8333-f7e743237942

def apply_replacements(obj, replacements):
    if isinstance(obj, dict):
        return {k: apply_replacements(v, replacements) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [apply_replacements(item, replacements) for item in obj]
    elif isinstance(obj, str):
        replaced_parts = []
        last_index = 0
        found = False

        for key, value in replacements.items():
            index = obj.find(key)
            while index != -1:
                found = True
                if index > last_index:
                    replaced_parts.append(obj[last_index:index])
                replaced_parts.append(value)
                last_index = index + len(key)
                index = obj.find(key, last_index)

        if found:
            if last_index < len(obj):
                replaced_parts.append(obj[last_index:])
            return {"$r": replaced_parts}
        else:
            return obj
    else:
        return obj

def reverse_replacements(obj, replacements):
    return _reverse_replacements(obj, {v: k for k, v in replacements.items()})

def _reverse_replacements(obj, replacements):
    if isinstance(obj, dict):
        if "$r" in obj:
            # Reconstruct the original string from the list
            return "".join(
                (replacements[part] if isinstance(part, int) else part)
                for part in obj["$r"]
            )
        else:
            return {k: _reverse_replacements(v, replacements) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [_reverse_replacements(item, replacements) for item in obj]
    else:
        return obj

So now you can do this:

from llm.utils import apply_replacements, reverse_replacements
replacements = {"This is a pretty long string at this point": 1, "And so is this one": 2}
json_object = {
    "foo": {
        "bar": {
            "baz": "this includes This is a pretty long string at this point",
            "qux": [44, "And so is this one"],
            "quux": "This has This is a pretty long string at this point and And so is this one as well"
        }
    }
}

replaced = apply_replacements(json_object, replacements)
orig = reverse_replacements(replaced, replacements)

And the replaced version looks like this:

{'foo': {'bar': {'baz': {'$r': ['this includes ', 1]},
                 'quux': {'$r': ['This has ', 1, ' and ', 2, ' as well']},
                 'qux': [44, {'$r': [2]}]}}}

simonw commented 1 week ago

The last remaining challenge is search. The system currently has a responses_fts table that indexes the prompt and response columns from responses:

https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/docs/logging.md#L164-L168

That design doesn't work if some of the prompts are in responses.prompt and others are in the new contexts table and related by a foreign key instead.

I could probably denormalize this, but that seems like it could be wasteful since the whole point of this exercise is to avoid storing the same long prompts hundreds of times... but if we store hundreds of copies in the FTS index table have we really improved things?

So I likely need a separate contexts_fts table... but now I'm stuck trying to do relevance scoring across two different FTS tables, which doesn't really work.

I think I'll have to accept a reduction in relevance scoring because of this. That's OK - the search feature isn't really a signature big deal, it's more of a convenience. I don't think it matters too much if the relevance scoring is a bit out of whack.

simonw commented 1 week ago

Note that response is always stored separately in the responses table - it's only prompt (of the two columns in the responses_fts table) that is affected by this new table design.

simonw commented 1 week ago

I think the migrations I added broke response_fts entirely, maybe by rewriting the rowid values when I ran .transform() to move some table columns around such that I need a new migration to re-index the table entirely.

simonw commented 1 week ago

Pushed my WIP to a branch: https://github.com/simonw/llm/commit/b2fce50aadae4a48f7eac64ab4417a806040a9e8

simonw commented 1 week ago

I'm second-guessing the design for this feature again now.

As implemented this kicks in by magic if the prompt is longer than a certain threshold:

https://github.com/simonw/llm/blob/b2fce50aadae4a48f7eac64ab4417a806040a9e8/llm/models.py#L230-L236

This assumes that prompts and system prompts will be entirely duplicated occasionally.

But maybe that's not the smartest way to do this. What if instead these reusable contexts could be specified by the user directly and could even be concatenated together?

Imagine being able to do something like this:

llm -f readme.md -f docs/usage.md 'How do I install this?'

Here we are concatenating two files together. Those files could be stored in two records in contexts, which would allow them to be combined in different ways in the future while still avoiding duplicating storage for them.

simonw commented 6 days ago

Relevant: https://twitter.com/alexalbert__/status/1857457290917589509

You can now access all of our docs concatenated as a single plain text file that can be fed in to any LLM.

Here's the url route: http://docs.anthropic.com/llms-full.txt

simonw commented 6 days ago

... if I DO go the -f route I think it should accept both file paths and URLs, because that's what the -a option does already:

llm -m claude-3.5-sonnet -f http://docs.anthropic.com/llms-full.txt 'how do I request a stream with curl?'

Note that http://docs.anthropic.com/llms-full.txt is a redirect to the https version, so following redirects is important - I just tried this:

curl 'http://docs.anthropic.com/llms-full.txt' | llm -m claude-3-5-haiku-latest 'a haiku about this document'

And got:

Here's a haiku about this HTTP 301 Moved Permanently response:

Cloudflare whispers soft Redirected web page falls Elsewhere, silently

The haiku captures the essence of the 301 status code - a gentle, permanent redirection signaled by Cloudflare, with the page moving to a new location in a quiet, understated manner.

simonw commented 5 days ago

I am going to call these fragments. The CLI option will be -f/--fragment and will accept a file Path or a URL or a -.

I am going to expose these in the Python API as well - as a prompt.fragments list of strings.

Open question: how about letting fragments be specified by ID as well? That way users could store them deliberately in the database.

simonw commented 5 days ago

There is definitely something interesting about being able to run a prompt that references multiple fragments. Imagine something like this:

llm -f datasette -f llm "ideas for Datasette plugins powered by LLM"

Where datasette is a fragment ID for a copy of the Datasette docs and llm is a fragment ID for the LLM docs.

How to tell the difference between -f datasette meaning the Datasette stored fragment and meaning a file on disk called datasette? Fragment ID wins, if you want the file instead use -f ./datasette instead to clarify.

simonw commented 5 days ago

The only reason to expose fragments in the Python prompt.fragments = [str] API is the code for persisting a prompt and response to a database knows to treat them specially.

simonw commented 5 days ago

This design implies the need for a family of CLI commands for managing fragments and their aliases. Maybe:

# list (truncated) stored fragments
llm fragments list
# show one (takes alias, ID or hash)
llm fragments show ALIAS
# search them
llm fragments list -q LLM
# add one with an alias
llm fragments set llm ./llm.txt

Can you remove fragments? Doing so could break existing DB records that rely on them - but maybe there's private information in them that you didn't mean to save?

I thank you can remove them, and doing so will cause those affected records to show that the fragments have been removed while still retuning the logged response.

# Remove stored fragment
llm fragments remove llm

simonw commented 5 days ago

So the fragments table has:

id - numeric ID used in M2M tables
hash - string hash used for deduping
content string content
alias - null allowed unique indexed alias string

I might have a created date time on there too. And maybe even a source URL or filename.

simonw commented 5 days ago

One thing this new design does not take into account is system prompts.

Maybe --sf X/--system-fragment X?

Would also need a prompt.system_fragments list of strings - and that's a bit weird because of the thing where some models don't let you change system prompt half way through a conversation (a problem we have with prompt.system already so at least it's not a new problem).

simonw commented 4 days ago

Made a start on this here - I've implemented the schema and the code such that prompts are aware of it, but I'm not yet persisting the fragments to the new database tables.

https://github.com/simonw/llm/blob/6c355c1e534f27a8a3b4d2930fe6e5a99c34c29a/llm/models.py#L90-L133

simonw commented 4 days ago

A slight oddity with this design for aliases: if you set the alias of a piece of content to one thing, then set a different alias for the same byte-for-byte piece of content from elsewhere, the original alias will be forgotten.

That feels confusing, and doesn't match with how aliases work for models where one model can have multiple aliases.

Potential solutions:

Do nothing, accept this - it's a rare edge case
Warn the user if they attempt to use llm fragments set ... on content that has an alias according to its hash already
Change the DB schema - instead of alias being a nullable unique column on the fragments table it should instead be a separate fragment_aliases table which can assign multiple aliases to a single hash

That 3rd option is the least surprising in my opinion. It also removes the slightly weird unique-but-nullable alias column from the fragments table.

simonw commented 4 days ago

db["fragment_aliases"].create(
    {
        "alias": str,
        "fragment_id": int,
    },
    foreign_keys=("fragment_id", "fragments", "id"),
    pk=("alias",),
)

The primary key IS the alias, since you can't have multiple aliases with the same name.

simonw commented 4 days ago

I think removing an alias should use garbage collection rules: if the aliased fragment is referenced by at least one other alias OR is referenced in the prompt_fragments or system_fragments table it is left alone, if no references to it remain then it is deleted.

simonw commented 4 days ago

I'm not going to bother implementing garbage collection for the moment.

simonw commented 4 days ago

This is now mostly working. The prompt_json column is a problem though - it's storing a duplicate of the whole prompt when the point of this exercise is to cut down on how much duplicated junk we save in that table.

CleanShot 2024-11-17 at 21 17 26@2x

taygetea commented 4 days ago

i see you just touched this but for what its worth a few days ago i tried to add a text file as an attachment assuming it would work and was surprised it didnt. was the mental model i had. so this is good.

simonw / llm

Option for passing in longer context fragments, stored in a deduped table #617