Closed simonw closed 11 months ago
Very early initial design idea. A new hook implementation:
@hookspec
def register_embedding_models(register):
"Return a list of model instances that can be used for embedding"
Each of those models will have a .embed(text)
method that returns a list of floats.
Then llm.get_embedding_model()
and llm.get_embedding_models()
functions.
Moving this to a PR:
The method currently returns a Python list of floats.
I know from experience that I usually end up wanting to store these as a packed array of binary floats (so I can put them in a SQLite blob) - usually using this:
def decode(blob):
return struct.unpack("f" * 1536, blob)
def encode(values):
return struct.pack("f" * 1536, *values)
I should figure out where this convenience can fit into LLM itself.
I think the CLI embed
command defaults to returning a JSON array, but maybe that can return a comma-separated list of floats or even a binary blog based on some command options.
A funny thing about embeddings is that they're only actually useful if you can make comparisons against them.
LLM already has a SQLite database for storing prompts. Maybe LLM itself also grows mechanisms for optionally storing embeddings in its own database, plus commands/functions for searching against them?
The problem there is that SQLite on its own will need to use brute-force for this, which will be slow.
But... if there was a plugin hook this could be a lot better. I could let people install plugins for FAISS and sqlite-vss and Pinecone and suchlike.
So out of the box LLM can give you slow vector search using brute force against embeddings stored in SQLite. But if you need faster lookups they should just be one plugin away.
(I have a 5 hour plane ride today and my initial goal for this issue was to give myself the pieces I needed to hack on Retrieval Augmented Generation experiments on the plane... which I've now got, thanks to that prototype llm-embed-sentence-transformers
plugin in https://github.com/simonw/llm/pull/186#issuecomment-1694485334)
I need a design for the CLI version of this.
Embeddings on their own aren't very useful - you get back a list of floating point numbers, what are you meant to do with it?
So I think the most common operation is going to be "embed this and store it so I can refer to it later".
I'll introduce a concept of an "embedding collection" - e.g. blog-posts
or notes
- which you can add embedded text to along with an ID for that post within its collection.
For example:
cat post.txt | llm embed -m ada posts 1
Here we are using the ada
embedding model (the OpenAI default one)adding to a collection called
postsand setting an ID within that collection of
1`.
If you don't specify the collection and the ID, it can return the embedding result directly. Also, specifying -i filename
can be the alternative to piping content to stdin
:
llm embed -i post.txt -m ada
Returns:
[ … JSON … ]
Add --bytes
to get back a binary blob of struct.pack("f" * ...)
:
llm embed -i post.txt --bytes
Maybe --hex
for the hex version of that? Not sure if that would ever be useful. May as well add --base64
there too, but should confirm that it's at least shorter than --json
.
For short strings like "hello world"
it would be neat not to need to pass them to stdin
or read them from a file. Can use -c/--content
for that, like python -c
:
llm embed -c 'content goes here'
If you tell it a collection name and ID, it defaults to silence - unless you add --json
or --bytes
(or maybe --hex
or --base64
):
llm embed -c 'content goes here' texts 1 --json
A notable thing about collections is that they should always use the same model, otherwise you can't compare items stored within them.
So the first time you reference a collection it will record which embedding model was used (including if it was the default, whatever that default was current set to). Then anything else written to that collection will automatically use the same model.
Schema:
collections: id (int), name (str), model (str)
embeddings: collection_id (int), id (str), embedding (blob), meta (str)
The meta
column there is JSON that can be used to store arbitrary additional key/value pairs against an embedding blob. I'm not sure if I'll use that for anything yet, but it feels useful.
I was going to call that collections
table embedding_collections
so that it couldn't be confused with a future prompt_collections
feature I might decide to add, but since embeddings go in a separate database from prompts and responses I decided not to do that.
Storing these things isn't particularly interesting if you can't then do something with them.
I'm going to add one more command: llm similar
- which returns the N most similar results to the thing you pass in.
This will return 10 most similar posts to the post with ID 1:
llm similar posts 1
Use -n 20
to change the number of results.
Or you can embed text and use it for the lookup straight away:
llm similar posts -c 'this is content'
No need to ever specify the model here because that can be looked up for the collection.
The big question: how do alternative vector search plugins come into play here?
The default is going to be a Python brute-force algorithm - but I very much want to have plugins able to add support for sqlite-vss
or FAISS or Pinecone or similar.
Perhaps an index can be configured against a collection, then any changes to that collection automatically trigger an update to that index.
llm configure-vector-index posts faiss /tmp/faiss-index
Alternative names:
llm vector posts faiss …
llm configure-vector pasts faiss …
llm index posts faiss …
What database should these be stored in?
The logs.db
database feels wrong to me because of the name. I could rename it I suppose.
I'm inclined to have a embed.db
database in the io.datasette.llm
directory by default, but let people pass -d data.db
to the various embedding commands (or set a LLM_EMBED_DB=path
environment variable) if they want to use something else.
Another important question: what should the Python API version of thosellm embed
and llm similar
commands look like?
Maybe that becomes an EmbeddingCollection
class which gets passed a sqlite_utils
object to its constructor and then provides .embed()
and similar()
methods.
Ran a comparison of the formats using ChatGPT Code Interpreter. For 100 floats:
https://chat.openai.com/share/3db82122-756c-4184-b26b-09d0ca1fe0af
I'm going to do all four options: --json
, --blob
, --hex
and --base64
.
Actually I'll do -f json/blob/hex/base64
instead so I don't need four options.
A thought: it would be convenient to have the option to store the text itself directly in that embeddings
table. This would make it much easier to implement things like basic RAG without needing to join against text stored elsewhere.
Maybe a -t
option? Or -s/--store
for store?
No, because -s
means --system
elsewhere and there may be embedding models that have a concept similar to a system prompt.
I think --store
as a long-form option with no short version. -t
means --template
or --truncate
elsewhere as well.
Reviewing the existing --help
I have some design decisions:
Commands:
prompt* Execute a prompt
aliases Manage model aliases
embed Embed text and store or return the result
install Install packages from PyPI into the same environment as LLM
keys Manage stored API keys for different models
logs Tools for exploring logged prompts and responses
models Manage available models
openai Commands for working directly with the OpenAI API
plugins List installed plugins
templates Manage stored prompt templates
uninstall Uninstall Python packages from the LLM environment
What should the embeddings equivalent of llm models
be?
Some options:
llm embedding-models
- a bit long to typellm embed-models
- shorter, but I'm a bit confused myself over when to use embed
v.s. when to use embeddings
llm emodels
- pretty non-obviousllm e-models
- a bit less non-obvious but still a bit obscurellm models --embeddings
- I don't like this, it should be a separate command, not an optionllm models embeddings
- maybe? Bit weird.I don't like any of these much but I'm leaning towards embed-models
at the moment.
The thing I like about embed-models
is that it feels at least a little bit consistent with the existing embed
and models
commands.
Yeah, embed-models
looks good in this list:
Commands:
prompt* Execute a prompt
aliases Manage model aliases
embed Embed text and store or return the result
embed-models Manage available embedding models
Next problem though: aliases
.
I want embedding models to be able to have aliases too.
But... the existing aliases
command only works for regular LLM models.
And it assumes a single global namespace for the model IDs themselves. What if an embedding model wants to have the same model ID as a LLM model? How would the llm aliases set my-alias model-id
command know which of the two should be referenced by that alias?
One option is that setting an alias for a model_id
that is available in both contexts sets the alias in both contexts too.
This kind of makes sense, because at some point I expect you'll be able to e.g. llm install llm-llama2-7b
and get a plugin that includes a language model which offers both a LLM and an embedding model - so having the same model ID for both (and setting an alias which corresponds to both) would make sense.
And aliases are currently stored like this:
cat ~/Library/Application\ Support/io.datasette.llm/aliases.json
{
"w": "mlc-chat-WizardLM-13B-V1.2-q4f16_1",
"l2": "mlc-chat-Llama-2-7b-chat-hf-q4f16_1",
"llama-2-13b": "mlc-chat-Llama-2-13b-chat-hf-q4f16_1",
"l13b": "mlc-chat-Llama-2-13b-chat-hf-q4f16_1",
"l7b": "mlc-chat-Llama-2-7b-chat-hf-q4f16_1",
"Llama-2-7b-chat": "mlc-chat-Llama-2-7b-chat-hf-q4f16_1",
"l2u": "mlc-chat-georgesung-llama2-7b-chat-uncensored-q4f16_1",
"claude": "openrouter/anthropic/claude-2",
"turbo": "gpt-3.5-turbo-16k",
"llama70b": "meta-llama/Llama-2-70b-chat-hf",
"codellama-python": "codellama-13b-python.ggmlv3.Q4_K_S"
}
I'm going to implement the llm embed-models default
command as well, for seeing the default embedding model and setting it.
Adding two more commands:
llm embed-db path
Outputs the path to the embeddings.db
database. Means you can do things like this:
datasette "$(llm embed-db path)"
And:
llm embed-db collections
Outputs a list of collections in the database.
llm embed-db collections --json
[
{
"name": "examples",
"model": "sentence-transformers/all-MiniLM-L6-v2",
"num_embeddings": 4
},
{
"name": "phrases",
"model": "sentence-transformers/all-MiniLM-L6-v2",
"num_embeddings": 1
}
]
And:
llm embed-db collections
examples: sentence-transformers/all-MiniLM-L6-v2
4 embeddings
phrases: sentence-transformers/all-MiniLM-L6-v2
1 embedding
Landed that on main
. Documentation so far is here:
I'm going to close this and do the rest of the work in separate tickets.
I want to start experimenting more with Retrieval Augmented Generation. As part of that, I want to be able to calculate embeddings against different models.
I want
llm
to grow allm embed
command and a Python API for that too, which have their own plugin hook.