Closed simonw closed 1 year ago
I think there are two parts to this: embedding a string, and managing collections.
For embedding strings the existing get_embedding_model(...)
API is most of the way there:
model = llm.get_embedding_model("ada-002")
floats = model.embed("text goes here")
I think the decode
and encode
functions for turning them into binary could go in the llm
namespace directly.
Collections are a bit harder. A collection should be able to store embeddings and run similarity, see #190 - eventually also manage indexes.
Sketching an initial idea:
collection = llm.Collection(db, "name-of-collection")
# If the collection does not exist it would be created with the default embedding model
if collection.exists():
# Already exists in the DB
print("Contains {} items".format(collection.count())
# Or specify the model specifically:
model = llm.get_embedding_model("ada-002")
collection = llm.Collection(db, "posts", model)
# Or pass the model ID using a named parameter:
collection = llm.Collection(db, "posts", model_id="ada-002")
Once you've got the collection:
collection.embed("id", "text to embed goes here")
# Add store=True to store the text in the content column
# With metadata:
collection.embed("id", "text to embed goes here", {"metadata": "here"})
# Or for multiple things at once:
collection.embed_multi({
"id1": "text for id1",
"id2": "text for id2"
})
# Add store=True to store the text in the content column
But what if you want to store metadata as well? Not 100% sure about that, maybe:
collection.embed_multi({
"id1": ("text for id1", {"metadata": "goes here"}),
"id2": "text for id2"
})
Not crazy about an API design where it accepts a dictionary with either strings or tuples as keys though.
Maybe this:
collection.embed_multi_with_metadata({
"id1": ("text for id1", {"metadata": "goes here"}),
"id2": ("text for id2", {"more": "metadata"}),
})
And for retrieval:
ids_and_scores = collection.similar_by_id("id", number=5)
Or:
ids_and_scores = collection.similar("text to be embedded", number=5)
For embedding models that take options (not a thing yet) I think I'll add options=dict
parameters to some of these methods, as opposed to using **kwargs
which could clash with other keyword arguments like store=True
.
Need to implement the similar
methods next:
mypy
errors:
llm/embeddings.py:46: error: Item "View" of "Table | View" has no attribute "insert" [union-attr]
llm/embeddings.py:50: error: Item "None" of "EmbeddingModel | None" has no attribute "model_id" [union-attr]
llm/embeddings.py:55: error: Incompatible return value type (got "Any | None", expected "int") [return-value]
llm/embeddings.py:105: error: Item "None" of "EmbeddingModel | None" has no attribute "embed" [union-attr]
llm/embeddings.py:106: error: Item "View" of "Table | View" has no attribute "insert" [union-attr]
llm/default_plugins/openai_models.py:71: error: Return type "list[list[float]]" of "embed_batch" incompatible with return type "Iterator[list[float]]" in supertype "EmbeddingModel" [override]
llm/default_plugins/openai_models.py:71: error: Argument 1 of "embed_batch" is incompatible with supertype "EmbeddingModel"; supertype defines the argument type as "Iterable[str]" [override]
llm/default_plugins/openai_models.py:71: note: This violates the Liskov substitution principle
llm/default_plugins/openai_models.py:71: note: See https://mypy.readthedocs.io/en/stable/common_issues.html#incompatible-overrides
Also need to refactor the embed
CLI command to use llm.Collection
.
I haven't implemented these methods yet: https://github.com/simonw/llm/blob/212cd617f35fbc4c918e7681d6cc89b97da776f9/llm/embeddings.py#L128-L148
I also haven't tested and documented the store=True
and metadata=...
mechanisms.
Plus there's no way to get BACK the metadata/stored content yet.
I also haven't tested and documented the
store=True
andmetadata=...
mechanisms.Plus there's no way to get BACK the metadata/stored content yet.
These were both addressed in:
OK, this is ready now: https://llm.datasette.io/en/latest/embeddings/python-api.html
Split from:
185