qdrant / qdrant-client

Python client for Qdrant vector search engine
https://qdrant.tech
Apache License 2.0
765 stars 121 forks source link

Docu: module 'qdrant_client.models' has no attribute 'NestedContainer' #196

Closed hopkins385 closed 1 year ago

hopkins385 commented 1 year ago

As per docu the nested filter for the python client should be configured as following:

client.scroll(
    collection_name="{collection_name}",
    scroll_filter=models.Filter(
        must=[
            models.NestedContainer(
                nested=models.NestedCondition(
                    key="diet",
                    filter=(
                        must=[
                            models.FieldCondition(
                                key="food",
                                match=models.MatchValue(value="meat")
                            ),
                            models.FieldCondition(
                                key="likes",
                                match=models.MatchValue(value=True)
                            ),
                        ]
                    )
                )
            )
        ],
    ),
)

But after running into errors and looking into the code it appears to be this seems to be the valid syntax:

   filter=models.Filter(
        must=[
            models.NestedCondition(
                nested=models.Nested(
                    key="diet",
                    filter=(
                        must=[
                            models.FieldCondition(
                                key="food",
                                match=models.MatchValue(value="meat")
                            ),
                            models.FieldCondition(
                                key="likes",
                                match=models.MatchValue(value=True)
                            ),
                        ]
                    )
                )
            )
        ],
    ),

Or am I missing something?

joein commented 1 year ago

Hi @hopkins385

Yes, you are right, it is a typo in docs, thanks for pointing it out! We will update it soon

If you will have any other problems with filters feel free to reach us out or take a look at tests e.g. nested filter example

joein commented 1 year ago

Also it should be filter=models.Filter(must=[models.FieldCondition...., and not just filter=(must=...)

hopkins385 commented 1 year ago

cool. happy I was able to help. Maybe someone can explain shortly which approach would be more suitable for my usecase. Scenario: In a collection are multiple "documents" (langchain) clustered by custom metadata field "media_id". Problem: I want to filter the documents to say maybe just 2 out of 5 by passing an array of media_ids.

Which approach would be better? Sidenote: For me its not clear if the database is first looking for "similar" vectors and after that the result is filtered, or if first the filter is applied and after that "similar" vectors are located

Approach 1:

def get_qdrant_documents(query: str, collection_name: str, media_ids: List[str] | None) -> List[Document]:

# ...

    filtr = models.Filter(
        must=[]
    )

    if media_ids is not None:
        filtr.must.append(
            models.FieldCondition(
                key="metadata.media_id",
                match=models.MatchAny(any=media_ids)
            )
        )

    return qdrant.similarity_search(
        query=query,
        filter=filtr,
    )

Approach 2:

def get_qdrant_documents(query: str, collection_name: str, media_ids: List[str] | None) -> List[Document]:

    filtr = models.Filter(
        must=[
            models.NestedCondition(
                nested=models.Nested(
                    key="metadata",
                    filter=models.Filter(
                        must=[]
                    )
                )
            )
        ]
    )

    if media_ids is not None:
        for media_id in media_ids:
            filtr.must[0].nested.filter.must.append(
                models.FieldCondition(
                    key="media_id",
                    match=models.MatchValue(value=media_id)
                )
            )

    return qdrant.similarity_search(
        query=query,
        filter=filtr,
    )
davidmyriel commented 1 year ago

Thank you for spotting, updating docs.

joein commented 1 year ago

Hello, @hopkins385

I guess the second filter is not what you want since, should has to be used instead of must. I think in your case both approaches are not really different. Explanation for the difference between filters like metadata.media_id and those build with models.Nested can be found here docs

About sidenote: It is a bit more complex than just pre-filter or post-filter. Qdrant has an advanced query planner which helps to perform queries efficiently. We applied certain modifications to hnsw algorithm itself as well. This blogpost can help you better understand Qdrant internals.