simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
3.55k stars 190 forks source link

Initial abstraction for running embeddings #185

Closed simonw closed 11 months ago

simonw commented 11 months ago

I want to start experimenting more with Retrieval Augmented Generation. As part of that, I want to be able to calculate embeddings against different models.

I want llm to grow a llm embed command and a Python API for that too, which have their own plugin hook.

simonw commented 11 months ago

Very early initial design idea. A new hook implementation:

@hookspec
def register_embedding_models(register):
    "Return a list of model instances that can be used for embedding"

Each of those models will have a .embed(text) method that returns a list of floats.

Then llm.get_embedding_model() and llm.get_embedding_models() functions.

simonw commented 11 months ago

Moving this to a PR:

simonw commented 11 months ago

The method currently returns a Python list of floats.

I know from experience that I usually end up wanting to store these as a packed array of binary floats (so I can put them in a SQLite blob) - usually using this:

def decode(blob):
    return struct.unpack("f" * 1536, blob)

def encode(values):
    return struct.pack("f" * 1536, *values)

From https://github.com/simonw/openai-to-sqlite/blob/84d6bbd67379a054347dc9a7f84e6c7e225bfb67/openai_to_sqlite/cli.py#L362-L367

I should figure out where this convenience can fit into LLM itself.

simonw commented 11 months ago

I think the CLI embed command defaults to returning a JSON array, but maybe that can return a comma-separated list of floats or even a binary blog based on some command options.

simonw commented 11 months ago

A funny thing about embeddings is that they're only actually useful if you can make comparisons against them.

LLM already has a SQLite database for storing prompts. Maybe LLM itself also grows mechanisms for optionally storing embeddings in its own database, plus commands/functions for searching against them?

The problem there is that SQLite on its own will need to use brute-force for this, which will be slow.

But... if there was a plugin hook this could be a lot better. I could let people install plugins for FAISS and sqlite-vss and Pinecone and suchlike.

So out of the box LLM can give you slow vector search using brute force against embeddings stored in SQLite. But if you need faster lookups they should just be one plugin away.

simonw commented 11 months ago

(I have a 5 hour plane ride today and my initial goal for this issue was to give myself the pieces I needed to hack on Retrieval Augmented Generation experiments on the plane... which I've now got, thanks to that prototype llm-embed-sentence-transformers plugin in https://github.com/simonw/llm/pull/186#issuecomment-1694485334)

simonw commented 11 months ago

I need a design for the CLI version of this.

Embeddings on their own aren't very useful - you get back a list of floating point numbers, what are you meant to do with it?

So I think the most common operation is going to be "embed this and store it so I can refer to it later".

I'll introduce a concept of an "embedding collection" - e.g. blog-posts or notes - which you can add embedded text to along with an ID for that post within its collection.

For example:

cat post.txt | llm embed -m ada posts 1

Here we are using the ada embedding model (the OpenAI default one)adding to a collection calledpostsand setting an ID within that collection of1`.

If you don't specify the collection and the ID, it can return the embedding result directly. Also, specifying -i filename can be the alternative to piping content to stdin:

llm embed -i post.txt -m ada

Returns:

[ … JSON … ]

Add --bytes to get back a binary blob of struct.pack("f" * ...):

llm embed -i post.txt --bytes

Maybe --hex for the hex version of that? Not sure if that would ever be useful. May as well add --base64 there too, but should confirm that it's at least shorter than --json.

For short strings like "hello world" it would be neat not to need to pass them to stdin or read them from a file. Can use -c/--content for that, like python -c:

llm embed -c 'content goes here'

If you tell it a collection name and ID, it defaults to silence - unless you add --json or --bytes (or maybe --hex or --base64):

llm embed -c 'content goes here' texts 1 --json

Embedding collections

A notable thing about collections is that they should always use the same model, otherwise you can't compare items stored within them.

So the first time you reference a collection it will record which embedding model was used (including if it was the default, whatever that default was current set to). Then anything else written to that collection will automatically use the same model.

Schema:

collections: id (int), name (str), model (str)
embeddings: collection_id (int), id (str), embedding (blob), meta (str)

The meta column there is JSON that can be used to store arbitrary additional key/value pairs against an embedding blob. I'm not sure if I'll use that for anything yet, but it feels useful.

I was going to call that collections table embedding_collections so that it couldn't be confused with a future prompt_collections feature I might decide to add, but since embeddings go in a separate database from prompts and responses I decided not to do that.

simonw commented 11 months ago

Storing these things isn't particularly interesting if you can't then do something with them.

I'm going to add one more command: llm similar - which returns the N most similar results to the thing you pass in.

This will return 10 most similar posts to the post with ID 1:

llm similar posts 1

Use -n 20 to change the number of results.

Or you can embed text and use it for the lookup straight away:

llm similar posts -c 'this is content'

No need to ever specify the model here because that can be looked up for the collection.

The big question: how do alternative vector search plugins come into play here?

The default is going to be a Python brute-force algorithm - but I very much want to have plugins able to add support for sqlite-vss or FAISS or Pinecone or similar.

Perhaps an index can be configured against a collection, then any changes to that collection automatically trigger an update to that index.

llm configure-vector-index posts faiss /tmp/faiss-index

Alternative names:

llm vector posts faiss …
llm configure-vector pasts faiss …
llm index posts faiss …
simonw commented 11 months ago

What database should these be stored in?

The logs.db database feels wrong to me because of the name. I could rename it I suppose.

I'm inclined to have a embed.db database in the io.datasette.llm directory by default, but let people pass -d data.db to the various embedding commands (or set a LLM_EMBED_DB=path environment variable) if they want to use something else.

Another important question: what should the Python API version of thosellm embed and llm similar commands look like?

Maybe that becomes an EmbeddingCollection class which gets passed a sqlite_utils object to its constructor and then provides .embed() and similar() methods.

simonw commented 11 months ago

Ran a comparison of the formats using ChatGPT Code Interpreter. For 100 floats:

https://chat.openai.com/share/3db82122-756c-4184-b26b-09d0ca1fe0af

I'm going to do all four options: --json, --blob, --hex and --base64.

simonw commented 11 months ago

Actually I'll do -f json/blob/hex/base64 instead so I don't need four options.

simonw commented 11 months ago

A thought: it would be convenient to have the option to store the text itself directly in that embeddings table. This would make it much easier to implement things like basic RAG without needing to join against text stored elsewhere.

Maybe a -t option? Or -s/--store for store?

No, because -s means --system elsewhere and there may be embedding models that have a concept similar to a system prompt.

I think --store as a long-form option with no short version. -t means --template or --truncate elsewhere as well.

simonw commented 11 months ago

Reviewing the existing --help I have some design decisions:

Commands:
  prompt*    Execute a prompt
  aliases    Manage model aliases
  embed      Embed text and store or return the result
  install    Install packages from PyPI into the same environment as LLM
  keys       Manage stored API keys for different models
  logs       Tools for exploring logged prompts and responses
  models     Manage available models
  openai     Commands for working directly with the OpenAI API
  plugins    List installed plugins
  templates  Manage stored prompt templates
  uninstall  Uninstall Python packages from the LLM environment

What should the embeddings equivalent of llm models be?

Some options:

I don't like any of these much but I'm leaning towards embed-models at the moment.

simonw commented 11 months ago

The thing I like about embed-models is that it feels at least a little bit consistent with the existing embed and models commands.

simonw commented 11 months ago

Yeah, embed-models looks good in this list:

Commands:
  prompt*       Execute a prompt
  aliases       Manage model aliases
  embed         Embed text and store or return the result
  embed-models  Manage available embedding models

Next problem though: aliases.

I want embedding models to be able to have aliases too.

But... the existing aliases command only works for regular LLM models.

And it assumes a single global namespace for the model IDs themselves. What if an embedding model wants to have the same model ID as a LLM model? How would the llm aliases set my-alias model-id command know which of the two should be referenced by that alias?

simonw commented 11 months ago

One option is that setting an alias for a model_id that is available in both contexts sets the alias in both contexts too.

simonw commented 11 months ago

This kind of makes sense, because at some point I expect you'll be able to e.g. llm install llm-llama2-7b and get a plugin that includes a language model which offers both a LLM and an embedding model - so having the same model ID for both (and setting an alias which corresponds to both) would make sense.

simonw commented 11 months ago

And aliases are currently stored like this:

cat ~/Library/Application\ Support/io.datasette.llm/aliases.json
{
    "w": "mlc-chat-WizardLM-13B-V1.2-q4f16_1",
    "l2": "mlc-chat-Llama-2-7b-chat-hf-q4f16_1",
    "llama-2-13b": "mlc-chat-Llama-2-13b-chat-hf-q4f16_1",
    "l13b": "mlc-chat-Llama-2-13b-chat-hf-q4f16_1",
    "l7b": "mlc-chat-Llama-2-7b-chat-hf-q4f16_1",
    "Llama-2-7b-chat": "mlc-chat-Llama-2-7b-chat-hf-q4f16_1",
    "l2u": "mlc-chat-georgesung-llama2-7b-chat-uncensored-q4f16_1",
    "claude": "openrouter/anthropic/claude-2",
    "turbo": "gpt-3.5-turbo-16k",
    "llama70b": "meta-llama/Llama-2-70b-chat-hf",
    "codellama-python": "codellama-13b-python.ggmlv3.Q4_K_S"
}
simonw commented 11 months ago

I'm going to implement the llm embed-models default command as well, for seeing the default embedding model and setting it.

simonw commented 11 months ago

Adding two more commands:

llm embed-db path

Outputs the path to the embeddings.db database. Means you can do things like this:

datasette "$(llm embed-db path)"

And:

llm embed-db collections

Outputs a list of collections in the database.

simonw commented 11 months ago
llm embed-db collections --json
[
    {
        "name": "examples",
        "model": "sentence-transformers/all-MiniLM-L6-v2",
        "num_embeddings": 4
    },
    {
        "name": "phrases",
        "model": "sentence-transformers/all-MiniLM-L6-v2",
        "num_embeddings": 1
    }
]

And:

llm embed-db collections
examples: sentence-transformers/all-MiniLM-L6-v2
  4 embeddings
phrases: sentence-transformers/all-MiniLM-L6-v2
  1 embedding
simonw commented 11 months ago

Landed that on main. Documentation so far is here:

I'm going to close this and do the rest of the work in separate tickets.