simonw / llm

Access large language models from the command-line
https://llm.datasette.io
Apache License 2.0
4.4k stars 241 forks source link

Python API for embeddings #191

Closed simonw closed 1 year ago

simonw commented 1 year ago

Split from:

simonw commented 1 year ago

I think there are two parts to this: embedding a string, and managing collections.

For embedding strings the existing get_embedding_model(...) API is most of the way there:

model = llm.get_embedding_model("ada-002")
floats = model.embed("text goes here")

I think the decode and encode functions for turning them into binary could go in the llm namespace directly.

Collections are a bit harder. A collection should be able to store embeddings and run similarity, see #190 - eventually also manage indexes.

simonw commented 1 year ago

Sketching an initial idea:

collection = llm.Collection(db, "name-of-collection")
# If the collection does not exist it would be created with the default embedding model

if collection.exists():
    # Already exists in the DB
    print("Contains {} items".format(collection.count())

# Or specify the model specifically:
model = llm.get_embedding_model("ada-002")
collection = llm.Collection(db, "posts", model)

# Or pass the model ID using a named parameter:
collection = llm.Collection(db, "posts", model_id="ada-002")

Once you've got the collection:

collection.embed("id", "text to embed goes here")
# Add store=True to store the text in the content column

# With metadata:
collection.embed("id", "text to embed goes here", {"metadata": "here"})

# Or for multiple things at once:
collection.embed_multi({
    "id1": "text for id1",
    "id2": "text for id2"
})
# Add store=True to store the text in the content column

But what if you want to store metadata as well? Not 100% sure about that, maybe:

collection.embed_multi({
    "id1": ("text for id1", {"metadata": "goes here"}),
    "id2": "text for id2"
})

Not crazy about an API design where it accepts a dictionary with either strings or tuples as keys though.

Maybe this:

collection.embed_multi_with_metadata({
    "id1": ("text for id1", {"metadata": "goes here"}),
    "id2": ("text for id2", {"more": "metadata"}),
})
simonw commented 1 year ago

And for retrieval:

ids_and_scores = collection.similar_by_id("id", number=5)

Or:

ids_and_scores = collection.similar("text to be embedded", number=5)
simonw commented 1 year ago

For embedding models that take options (not a thing yet) I think I'll add options=dict parameters to some of these methods, as opposed to using **kwargs which could clash with other keyword arguments like store=True.

simonw commented 1 year ago

Need to implement the similar methods next:

https://github.com/simonw/llm/blob/6f761702dc7e85e7d24a38440309bac45c246d35/llm/embeddings.py#L138-L162

simonw commented 1 year ago

mypy errors:

llm/embeddings.py:46: error: Item "View" of "Table | View" has no attribute "insert"  [union-attr]
llm/embeddings.py:50: error: Item "None" of "EmbeddingModel | None" has no attribute "model_id"  [union-attr]
llm/embeddings.py:55: error: Incompatible return value type (got "Any | None", expected "int")  [return-value]
llm/embeddings.py:105: error: Item "None" of "EmbeddingModel | None" has no attribute "embed"  [union-attr]
llm/embeddings.py:106: error: Item "View" of "Table | View" has no attribute "insert"  [union-attr]
llm/default_plugins/openai_models.py:71: error: Return type "list[list[float]]" of "embed_batch" incompatible with return type "Iterator[list[float]]" in supertype "EmbeddingModel"  [override]
llm/default_plugins/openai_models.py:71: error: Argument 1 of "embed_batch" is incompatible with supertype "EmbeddingModel"; supertype defines the argument type as "Iterable[str]"  [override]
llm/default_plugins/openai_models.py:71: note: This violates the Liskov substitution principle
llm/default_plugins/openai_models.py:71: note: See https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides
simonw commented 1 year ago

Also need to refactor the embed CLI command to use llm.Collection.

simonw commented 1 year ago

I haven't implemented these methods yet: https://github.com/simonw/llm/blob/212cd617f35fbc4c918e7681d6cc89b97da776f9/llm/embeddings.py#L128-L148

simonw commented 1 year ago

I also haven't tested and documented the store=True and metadata=... mechanisms.

Plus there's no way to get BACK the metadata/stored content yet.

simonw commented 1 year ago

I also haven't tested and documented the store=True and metadata=... mechanisms.

Plus there's no way to get BACK the metadata/stored content yet.

These were both addressed in:

simonw commented 1 year ago

OK, this is ready now: https://llm.datasette.io/en/latest/embeddings/python-api.html