Open simonw opened 2 weeks ago
Pros of this approach:
Cons:
llm -a URL
could point to all kinds of text files, how do we know which ones get pasted into the prompt?BLOB
for the content, but we want to store strings. We can encode those as utf-8
in a blob but that might make querying the database directly confusing.attachments
just to see what the text version of the prompt was!Alternative approaches:
responses
table can optionally use a foreign key to a new reused_text
table which works a bit like attachments do right now.response
can point to some other response and say I used the same prompt
as this one.prompt_id
and system_id
.That last option is actually reasonably elegant. It feels a bit nasty to me to have a whole bunch of very short values that are stored in a separate table, but it may be that most prompts and system prompts are long enough that this might be worthwhile anyway.
Ideally whatever mechanism I use here should work for both system
and regular prompts.
If using existing attachments we could invent artificial content types of llm/prompt
and llm/system
for this.
Question: in my current database how many responses have duplicates in prompts or system prompts?
Simple dupe detection query:
sqlite-utils "$(llm logs path)" "
select substr(prompt, 0, 20), length(prompt), count(*), length(prompt) * count(*) as size
from responses
group by prompt
having count(*) > 1 order by size desc" -t
substr(prompt, 0, 20) length(prompt) count(*) size
----------------------- ---------------- ---------- ------
<documents> 332310 2 664620
<docume
src/util.rs 288313 2 576626
---
use
<documents> 112753 4 451012
<docume
jasondavies: 95647 4 382588
shnkr
882542F3884314B: 152700 2 305400
c
null 147256 2 294512
<p>Something t
diff --git a/.githu 111221 2 222442
null 109634 2 219268
<p>Something t
zerojames: 102411 2 204822
jayniz:
Start Timestamp;End 92772 2 185544
#!/usr/bin/env pyth 54755 3 164265
## v2.7.3 53962 3 161886
*2023-08
I think I need to prototype the approach where every prompt and system string is de-duped into a separate foreign key table, then run it against a copy of my database and compare the sizes.
Prototype:
cp "$(llm logs path)" /tmp/logs.db
sqlite-utils extract logs.db responses system --table prompt --rename system prompt --fk-column system_id
sqlite-utils extract logs.db responses prompt
The resulting DB was even bigger for some reason. A lot of the prompts/system prompts are duplicated in the prompt_json
already.
Oh here's why it's bigger, it put an index on the new table:
CREATE UNIQUE INDEX [idx_prompt_prompt]
ON [prompt] ([prompt]);
I really need to do the thing where there's a hash column which has an index on it and is used for lookups, but the foreign key remains an integer so we don't bloat the responses
table.
Dropping that index and running vacuum dropped the DB size from 170MB to 144MB, but the original is 136MB to we still gained 8MB somehow. Probably just because we needed a bunch of extra pages for the new table.
Given these problems with the prompt_id
/ system_id
approach I'm inclined to consider that original -a long-file.txt
attachment idea instead.
Or I could have optional prompt_id
and system_id
columns which foreign key against either attachments
or some new prompts
table - but continue to use the existing prompt
and system
columns for short prompts.
If I insert the prompts as utf-8 binary data into that content
column they look like this in Datasette:
Inserting them as text DOES work, but it feels really gnarly to have a BLOB column that sometimes has text in it, plus in the code Attachment.content
is typed as a bytes: https://github.com/simonw/llm/blob/febbc04fb6d847bd35ce5ba0156ef010f5894564/llm/models.py#L20-L26
Cleanest option here would be to have a content_text
column which is a string and use that instead of content
for non-binary attachment content.
Google Gemini models actually DO accept plain text attachments already: https://ai.google.dev/gemini-api/docs/document-processing?lang=rest
Gemini 1.5 Pro and 1.5 Flash support a maximum of 3,600 document pages. Document pages must be in one of the following text data MIME types:
- PDF -
application/pdf
- JavaScript -
application/x-javascript
,text/javascript
- Python -
application/x-python
,text/x-python
- TXT -
text/plain
- HTML -
text/html
- CSS -
text/css
- Markdown -
text/md
- CSV -
text/csv
- XML -
text/xml
- RTF -
text/rtf
Each document page is equivalent to 258 tokens.
So maybe this ticket is actually about supporting plain text attachments and having them work across ALL models, in addition to whatever is going on here with Gemini.
I wonder if any other models out there support attachments like text/md
and text/xml
and suchlike?
I started playing with a llm-docs
idea that could benefit from this: https://github.com/simonw/llm-docs
I currently run that like this:
curl -s "https://raw.githubusercontent.com/simonw/llm-docs/refs/heads/main/version-docs/$(llm --version | cut -d' ' -f3).txt" | \
llm -m gpt-4o-mini 'how do I embed a binary file?'
Here's a much simpler idea. I introduce two new llm prompt
arguments:
llm -f prompt.txt
- runs the prompt from that file (alias --file
)llm -u https://...
- runs the prompt fetched from that URL (alias --url
)And maybe two more that do the same for system prompts:
llm --sf prompt.txt
- alias --system-file
llm --su https://...
- alias --system-url
In addition, I treat cat prompt.txt | llm
as working the same as llm -f -
In all cases the text is anticipated to be longer than a regular prompt, and hence should be de-duplicated and stored in a separate table.
That table might be called prompts
or inputs
or maybe even contexts
- it will then have two nullable foreign keys from responses
for prompt_id
and system_id
.
This avoids the confusion that would come from reusing the existing attachments mechanism, and the UI for it feels pretty natural to me - especially since -f
and -u
are not yet used for anything else.
Current database tables are:
conversations
responses
attachments
prompt_attachments
I'm going to call the new table contexts
- I think prompts
is confusing as those are already stored as part of responses
and I like that it reflects the noun context
which is key to thinking about how to use LLMs.
How should these affect the existing templates
feature?
One option: templates work exactly as they do right now, so when you do llm -f prompt.txt --save foo
a foo.yaml
template is created with the contents of that file.
But... this actually brings up the point that templates right now are pretty inefficient. Using a template 100 times will drop 100 copies of that template into the responses
table, one for each response!
Fixing this is probably a separate issue. I could start having templates automatically use this mechanism - being treated effectively as piped in / -f
prompts and system prompts. But... that won't work for templates that include replacement params.
I could make it so replacement params work against these -u
and -f
inputs, and store which params were used for a given prompt. I'd need to add a params
optional JSON column to responses
for this.
I can do that as a separate piece of work.
If I'm doing this - supporting longer prompts using --file
and --url
- I could implement it such that IF one of those things is provided then the prompt is automatically stashed in the contexts
table.
But as I started prototyping that out I realized that an alternative rule could simple be that any prompts over the length of X characters are automatically stored in that table instead.
This also isn't a question I can get wrong. If I decide to make all prompts longer than 50 characters stored in contexts
instead, and then later decide that was a bad idea, I can change the policy without breaking anything - the code I've written that knows how to read from either responses.prompt
or the thing pointed to by responses.prompt_id
will continue to work regardless.
... in which case the --file
and --url
options are a distraction. That feature exists already in the form of curl ... | llm
and llm < prompt.txt
- if the only reason I was adding those is as a signal that longer prompts should be stored independently I don't need the new features at all if I go with a "prompts longer than X" design instead.
Here's as far as I got prototyping the --file
and --url
options:
diff --git a/llm/cli.py b/llm/cli.py
index 6a6fb2c..0f5f112 100644
--- a/llm/cli.py
+++ b/llm/cli.py
@@ -147,6 +147,16 @@ def cli():
@click.argument("prompt", required=False)
@click.option("-s", "--system", help="System prompt to use")
@click.option("model_id", "-m", "--model", help="Model to use")
+@click.option("file", "-f", "--file", type=click.File(), help="Read prompt from file")
+@click.option("url", "-u", "--url", help="Read prompt from URL")
+@click.option(
+ "system_file",
+ "--sf",
+ "--system-file",
+ type=click.File(),
+ help="Read system prompt from file",
+)
+@click.option("system_url", "--su", "--system-url", help="Read system prompt from URL")
@click.option(
"attachments",
"-a",
@@ -203,6 +213,10 @@ def prompt(
prompt,
system,
model_id,
+ file,
+ url,
+ system_file,
+ system_url,
attachments,
attachment_types,
options,
@@ -245,6 +259,16 @@ def prompt(
def read_prompt():
nonlocal prompt
+ prompt_bits = []
+
+ # Is there a file to read from?
+ if file:
+ prompt_bits.append(file.read())
+ if url:
+ response = httpx.get(url)
+ response.raise_for_status()
+ prompt_bits.append(response.text)
+
# Is there extra prompt available on stdin?
stdin_prompt = None
if not sys.stdin.isatty():
@@ -258,10 +282,12 @@ def prompt(
if (
prompt is None
- and not save
+ and file is None
+ and url is None
and sys.stdin.isatty()
and not attachments
and not attachment_types
+ and not save
):
# Hang waiting for input to stdin (unless --save)
prompt = sys.stdin.read()
I think I need to change this code that populates the responses
table:
https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/models.py#L221-L238
And code that reads from that table in two places:
https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/cli.py#L685-L706
https://github.com/simonw/llm/blob/5d1d723d4beb546eab4deb8bb8f740b2fe20e065/llm/models.py#L284-L319
I could even define a new responses_full
SQL view that inflates the foreign key system and regular prompts automatically, for developer convenience. It could extract out the attachments as JSON too.
But what do I do about the prompt_json
columns storing huge copies of the strings anyway?
I worked up this little system with the help of ChatGPT Code Interpreter: https://chatgpt.com/share/672d84c0-7360-8006-8333-f7e743237942
def apply_replacements(obj, replacements):
if isinstance(obj, dict):
return {k: apply_replacements(v, replacements) for k, v in obj.items()}
elif isinstance(obj, list):
return [apply_replacements(item, replacements) for item in obj]
elif isinstance(obj, str):
replaced_parts = []
last_index = 0
found = False
for key, value in replacements.items():
index = obj.find(key)
while index != -1:
found = True
if index > last_index:
replaced_parts.append(obj[last_index:index])
replaced_parts.append(value)
last_index = index + len(key)
index = obj.find(key, last_index)
if found:
if last_index < len(obj):
replaced_parts.append(obj[last_index:])
return {"$r": replaced_parts}
else:
return obj
else:
return obj
def reverse_replacements(obj, replacements):
return _reverse_replacements(obj, {v: k for k, v in replacements.items()})
def _reverse_replacements(obj, replacements):
if isinstance(obj, dict):
if "$r" in obj:
# Reconstruct the original string from the list
return "".join(
(replacements[part] if isinstance(part, int) else part)
for part in obj["$r"]
)
else:
return {k: _reverse_replacements(v, replacements) for k, v in obj.items()}
elif isinstance(obj, list):
return [_reverse_replacements(item, replacements) for item in obj]
else:
return obj
So now you can do this:
from llm.utils import apply_replacements, reverse_replacements
replacements = {"This is a pretty long string at this point": 1, "And so is this one": 2}
json_object = {
"foo": {
"bar": {
"baz": "this includes This is a pretty long string at this point",
"qux": [44, "And so is this one"],
"quux": "This has This is a pretty long string at this point and And so is this one as well"
}
}
}
replaced = apply_replacements(json_object, replacements)
orig = reverse_replacements(replaced, replacements)
And the replaced
version looks like this:
{'foo': {'bar': {'baz': {'$r': ['this includes ', 1]},
'quux': {'$r': ['This has ', 1, ' and ', 2, ' as well']},
'qux': [44, {'$r': [2]}]}}}
The last remaining challenge is search. The system currently has a responses_fts
table that indexes the prompt
and response
columns from responses
:
That design doesn't work if some of the prompts are in responses.prompt
and others are in the new contexts
table and related by a foreign key instead.
I could probably denormalize this, but that seems like it could be wasteful since the whole point of this exercise is to avoid storing the same long prompts hundreds of times... but if we store hundreds of copies in the FTS index table have we really improved things?
So I likely need a separate contexts_fts
table... but now I'm stuck trying to do relevance scoring across two different FTS tables, which doesn't really work.
I think I'll have to accept a reduction in relevance scoring because of this. That's OK - the search feature isn't really a signature big deal, it's more of a convenience. I don't think it matters too much if the relevance scoring is a bit out of whack.
Note that response
is always stored separately in the responses
table - it's only prompt
(of the two columns in the responses_fts
table) that is affected by this new table design.
I think the migrations I added broke response_fts
entirely, maybe by rewriting the rowid
values when I ran .transform()
to move some table columns around such that I need a new migration to re-index the table entirely.
Pushed my WIP to a branch: https://github.com/simonw/llm/commit/b2fce50aadae4a48f7eac64ab4417a806040a9e8
I'm second-guessing the design for this feature again now.
As implemented this kicks in by magic if the prompt is longer than a certain threshold:
https://github.com/simonw/llm/blob/b2fce50aadae4a48f7eac64ab4417a806040a9e8/llm/models.py#L230-L236
This assumes that prompts and system prompts will be entirely duplicated occasionally.
But maybe that's not the smartest way to do this. What if instead these reusable contexts could be specified by the user directly and could even be concatenated together?
Imagine being able to do something like this:
llm -f readme.md -f docs/usage.md 'How do I install this?'
Here we are concatenating two files together. Those files could be stored in two records in contexts
, which would allow them to be combined in different ways in the future while still avoiding duplicating storage for them.
Relevant: https://twitter.com/alexalbert__/status/1857457290917589509
You can now access all of our docs concatenated as a single plain text file that can be fed in to any LLM.
Here's the url route: http://docs.anthropic.com/llms-full.txt
... if I DO go the -f
route I think it should accept both file paths and URLs, because that's what the -a
option does already:
llm -m claude-3.5-sonnet -f http://docs.anthropic.com/llms-full.txt 'how do I request a stream with curl?'
Note that http://docs.anthropic.com/llms-full.txt is a redirect to the https version, so following redirects is important - I just tried this:
curl 'http://docs.anthropic.com/llms-full.txt' | llm -m claude-3-5-haiku-latest 'a haiku about this document'
And got:
Here's a haiku about this HTTP 301 Moved Permanently response:
Cloudflare whispers soft Redirected web page falls Elsewhere, silently
The haiku captures the essence of the 301 status code - a gentle, permanent redirection signaled by Cloudflare, with the page moving to a new location in a quiet, understated manner.
I am going to call these fragments. The CLI option will be -f/--fragment
and will accept a file Path or a URL or a -
.
I am going to expose these in the Python API as well - as a prompt.fragments
list of strings.
Open question: how about letting fragments be specified by ID as well? That way users could store them deliberately in the database.
There is definitely something interesting about being able to run a prompt that references multiple fragments. Imagine something like this:
llm -f datasette -f llm "ideas for Datasette plugins powered by LLM"
Where datasette
is a fragment ID for a copy of the Datasette docs and llm
is a fragment ID for the LLM docs.
How to tell the difference between -f datasette
meaning the Datasette stored fragment and meaning a file on disk called datasette
? Fragment ID wins, if you want the file instead use -f ./datasette
instead to clarify.
The only reason to expose fragments in the Python prompt.fragments = [str]
API is the code for persisting a prompt and response to a database knows to treat them specially.
This design implies the need for a family of CLI commands for managing fragments and their aliases. Maybe:
# list (truncated) stored fragments
llm fragments list
# show one (takes alias, ID or hash)
llm fragments show ALIAS
# search them
llm fragments list -q LLM
# add one with an alias
llm fragments set llm ./llm.txt
Can you remove fragments? Doing so could break existing DB records that rely on them - but maybe there's private information in them that you didn't mean to save?
I thank you can remove them, and doing so will cause those affected records to show that the fragments have been removed while still retuning the logged response.
# Remove stored fragment
llm fragments remove llm
So the fragments table has:
id
- numeric ID used in M2M tableshash
- string hash used for dedupingcontent
string contentalias
- null allowed unique indexed alias stringI might have a created date time on there too. And maybe even a source URL or filename.
One thing this new design does not take into account is system prompts.
Maybe --sf X/--system-fragment X
?
Would also need a prompt.system_fragments
list of strings - and that's a bit weird because of the thing where some models don't let you change system prompt half way through a conversation (a problem we have with prompt.system
already so at least it's not a new problem).
Made a start on this here - I've implemented the schema and the code such that prompts are aware of it, but I'm not yet persisting the fragments to the new database tables.
https://github.com/simonw/llm/blob/6c355c1e534f27a8a3b4d2930fe6e5a99c34c29a/llm/models.py#L90-L133
A slight oddity with this design for aliases: if you set the alias of a piece of content to one thing, then set a different alias for the same byte-for-byte piece of content from elsewhere, the original alias will be forgotten.
That feels confusing, and doesn't match with how aliases work for models where one model can have multiple aliases.
Potential solutions:
llm fragments set ...
on content that has an alias according to its hash alreadyalias
being a nullable unique column on the fragments
table it should instead be a separate fragment_aliases
table which can assign multiple aliases to a single hashThat 3rd option is the least surprising in my opinion. It also removes the slightly weird unique-but-nullable alias
column from the fragments
table.
db["fragment_aliases"].create(
{
"alias": str,
"fragment_id": int,
},
foreign_keys=("fragment_id", "fragments", "id"),
pk=("alias",),
)
The primary key IS the alias, since you can't have multiple aliases with the same name.
I think removing an alias should use garbage collection rules: if the aliased fragment is referenced by at least one other alias OR is referenced in the prompt_fragments
or system_fragments
table it is left alone, if no references to it remain then it is deleted.
I'm not going to bother implementing garbage collection for the moment.
This is now mostly working. The prompt_json
column is a problem though - it's storing a duplicate of the whole prompt when the point of this exercise is to cut down on how much duplicated junk we save in that table.
i see you just touched this but for what its worth a few days ago i tried to add a text file as an attachment assuming it would work and was surprised it didnt. was the mental model i had. so this is good.
The attachments feature makes it easy to attach video and audio from a URL or file while ensuring that even if that prompt component is used many times over it only ends up stored once in the database - in the
attachments
table using asha256
hash as the ID: https://github.com/simonw/llm/blob/febbc04fb6d847bd35ce5ba0156ef010f5894564/llm/models.py#L28-L39I find myself wanting the same thing for blocks of text: I'd like to construct e.g. a context that contains the full docs for LLM and run multiple prompts against that without storing many copies of the same exact text in the database.
One possible solution would be to re-use the attachments mechanism, with a new default plain text attachment type that all models support: