Support for embedding modes

simonw / llm

Access large language models from the command-line

https://llm.datasette.io

Apache License 2.0

4.1k stars 229 forks source link

Support for embedding modes #457

Open simonw opened 5 months ago

simonw commented 5 months ago

Several embedding models supported by LLM plugins have a concept of "modes" - usually called something like "task types" or "input types".

Some examples:

Gemini: https://github.com/google-gemini/cookbook/blob/9eb52260f979aa8339e6d1eb77323faa178bbd78/quickstarts/rest/Embeddings_REST.ipynb - "Use task_type to provide a hint to the model how you'll use the embeddings"
Nomic: https://github.com/simonw/llm-nomic-api-embed/issues/2
E5-large-v2: https://til.simonwillison.net/llms/embed-paragraphs - this one works using passage: and query: prefixes.

We need a mechanism to support these in LLM core itself, mainly for the llm similar command - we need to calculate the original stored embeddings for RETRIEVAL_DOCUMENT (in Gemini's terminology) but the search query should be RETRIEVAL_QUERY.

simonw commented 5 months ago

I asked about terminology on Twitter a couple of weeks ago: https://twitter.com/simonw/status/1774278907380019637

https://docs.voyageai.com/docs/embeddings calls them input_type - values query and document
Cohere has https://txt.cohere.com/introducing-embed-v3/ input_type of search_document, search_query, classification, clustering

simonw commented 5 months ago

I'm thinking:

llm embed -c 'hello world' -m nomic-1.5 --mode clustering

The llm embed and llm embed-multi commands will default to the one that is designed for stored documents.

llm similiar will default to the one that's intended for retrieval.

All three commands will accept a --mode option to switch to something other than the default for that command.

Modes will be validated against the list of known modes for the embedding model.

So maybe the code looks like this:

class NomicAIEmbeddingModel(EmbeddingModel):
    needs_key = "nomic"
    key_env_var = "NOMIC_API_KEY"
    batch_size = 100
    modes = ["search_document", "search_query", "clustering", "classification"]
    default_document_mode = "search_document"
    default_query_mode = "search_query"

The selected mode is then passed as an argument to the embed_batch() method - but only for models that defined modes.

simonw commented 5 months ago

I'm tempted to have modes defined as an enum of some sort, that way the Python API for embeddings could look something like this:

vector = nomic.embed("reasons to get a goat", mode=nomic.Modes.search_query)

And maybe the class then looks like this:

from enum import Enum

class NomicAIEmbeddingModel(EmbeddingModel):
    ...
    class Modes(Enum):
        search_document = "search_document"
        search_query = "search_query"
        clustering = "clustering"
        classification = "classification"
    default_mode_search = Modes.search_query
    default_mode_document = Modes.search_document

I considered using StrEnum but it was only added in Python 3.11.