Open paluigi opened 1 month ago
Hi @paluigi
Methods like add
and query
are just convenience methods, which provide some default configurations.
The only workaround to use them with another metric is to create a collection with create_collection
call (before calling either of add
or query
).
However, take into account, that add
and query
methods have specific rules for the vector names, they should be the same as the ones returned by get_vector_field_name()
HI @joein , thanks for the reply. If I want to use FastEmbed then what would be the correct way to use the Distance.DOT metric in a collection?
For what I can see, in my example above the collection would be initialized with the Distance.COSINE metric, even if I tried to set another metric.
The set_model method does not have a distance parameter.
def set_model(
self,
embedding_model_name: str,
max_length: Optional[int] = None,
cache_dir: Optional[str] = None,
threads: Optional[int] = None,
providers: Optional[Sequence["OnnxProvider"]] = None,
**kwargs: Any,
):
# Method body
You can specify a different metric while creating a collection. But as far as I know you can not specify a default metric for all operations on the client.
While creating the collection the vector name and the size of the vector field has to match the model specs. One way to do this would be to create a helper method using pieces from the FastEmbedMixin.
import uuid
from qdrant_client import QdrantClient
from qdrant_client import models
from fastembed import TextEmbedding
def add_points(
collection_name, documents, ids=None, model_name=None, distance=models.Distance.DOT
):
# We could also pass in the client as a param
client = QdrantClient(path="./db/")
if model_name is not None:
client.set_model(model_name)
# Get the vector field name and the vector size for the chosen model.
# Using the exact name is important because client.query() looks at the
# {vector_field_name} vector.
vector_field_name = client.get_vector_field_name()
vector_params = client.get_fastembed_vector_params()
# Create the collection if it does not exist.
if not client.collection_exists(collection_name):
client.create_collection(
collection_name=collection_name,
vectors_config={
vector_field_name: models.VectorParams(
size=vector_params[vector_field_name].size,
distance=distance,
)
},
)
# Load the embedding model from FastEmbed.
if model_name is not None:
embedding_model = TextEmbedding(model_name)
else:
embedding_model = TextEmbedding()
# Create a generator for UUIDs, if ids are not passed.
if ids is None:
ids = iter(lambda: uuid.uuid4().hex, None)
elif type(ids) is list:
ids = iter(ids)
# Upload points
client.upload_points(
collection_name=collection_name,
points=[
models.PointStruct(
id=next(ids),
vector={
vector_field_name: embedding,
},
)
# Embed the documents using the embedding_model
for embedding in embedding_model.embed(documents)
],
)
documents = [
"This is built to be faster and lighter than other embedding libraries e.g. Transformers, Sentence-Transformers, etc.",
"fastembed is supported by and maintained by Qdrant.",
]
model_name = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2"
add_points(
collection_name="test_collection",
documents=documents,
model_name=model_name,
)
Because we have followed the naming conventions we can use the query method out of the box.
client.set_model(model_name)
search_result = client.query(
collection_name="test_collection",
query_text=query_text,
)
This is a very minimal implementation and there might be better ways to do this. But it could be modified to suit your purpose.
Issue
Not able to change distance model when creating a collection with FastEmbed.
Minimal steps to reproduce
Result
{'fast-paraphrase-multilingual-mpnet-base-v2': VectorParams(size=768, distance=<Distance.COSINE: 'Cosine'>, hnsw_config=None, quantization_config=None, on_disk=None, datatype=None, multivector_config=None)}
Expected result
Distance should be Distance.DOT
Environment
OS: Ubuntu 22.04.1 qdrant-client==1.10.1 fastembed==0.3.4
More details
When creating a collection with Sentence Transformers I am able to set a different distance model (DOT, EUCLID). With FastEmbed it seems the distance is only cosine. Also looking to the code, it seems all models are initialized with cosine distance only.