Closed hopkins385 closed 1 year ago
Hi @hopkins385
Yes, you are right, it is a typo in docs, thanks for pointing it out! We will update it soon
If you will have any other problems with filters feel free to reach us out or take a look at tests e.g. nested filter example
Also it should be filter=models.Filter(must=[models.FieldCondition....
, and not just filter=(must=...)
cool. happy I was able to help. Maybe someone can explain shortly which approach would be more suitable for my usecase. Scenario: In a collection are multiple "documents" (langchain) clustered by custom metadata field "media_id". Problem: I want to filter the documents to say maybe just 2 out of 5 by passing an array of media_ids.
Which approach would be better? Sidenote: For me its not clear if the database is first looking for "similar" vectors and after that the result is filtered, or if first the filter is applied and after that "similar" vectors are located
Approach 1:
def get_qdrant_documents(query: str, collection_name: str, media_ids: List[str] | None) -> List[Document]:
# ...
filtr = models.Filter(
must=[]
)
if media_ids is not None:
filtr.must.append(
models.FieldCondition(
key="metadata.media_id",
match=models.MatchAny(any=media_ids)
)
)
return qdrant.similarity_search(
query=query,
filter=filtr,
)
Approach 2:
def get_qdrant_documents(query: str, collection_name: str, media_ids: List[str] | None) -> List[Document]:
filtr = models.Filter(
must=[
models.NestedCondition(
nested=models.Nested(
key="metadata",
filter=models.Filter(
must=[]
)
)
)
]
)
if media_ids is not None:
for media_id in media_ids:
filtr.must[0].nested.filter.must.append(
models.FieldCondition(
key="media_id",
match=models.MatchValue(value=media_id)
)
)
return qdrant.similarity_search(
query=query,
filter=filtr,
)
Thank you for spotting, updating docs.
Hello, @hopkins385
I guess the second filter is not what you want since, should
has to be used instead of must
.
I think in your case both approaches are not really different.
Explanation for the difference between filters like metadata.media_id
and those build with models.Nested
can be found here docs
About sidenote: It is a bit more complex than just pre-filter or post-filter. Qdrant has an advanced query planner which helps to perform queries efficiently. We applied certain modifications to hnsw algorithm itself as well. This blogpost can help you better understand Qdrant internals.
As per docu the nested filter for the python client should be configured as following:
But after running into errors and looking into the code it appears to be this seems to be the valid syntax:
Or am I missing something?