run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.22k stars 4.94k forks source link

[Question]: Query-time metadata filtering #14802

Open MatinKhajavi opened 1 month ago

MatinKhajavi commented 1 month ago

Question Validation

Question

Hi,

I am trying to implement a low-level Pinecone Retriever, and I want the ability to have a different set of filters for each query (since each query might have different metadata) during retrieval. When inheriting from the BaseRetriever class, we override the _retrieve function, but there is no way to pass these filters to retrieve.

From the documentation and discussions on Discord (As answered by @logan-markewich ), the only solution I've found is to recreate the retrieval process for each query. I have managed to do this by adding a set_filters function and calling it each time before calling the retrieve method. (I guess it is either that or having a function to recreate the VectoreStoreQuery and passing filters through that)

Am I missing something, or is this the only way to achieve this functionality?

Additionally, shouldn't QueryBundle have a filters attribute for cases like this?

I can implement this myself and submit a pull request if you think it is a good idea.

Here is my current code:

from typing import Union, List, Optional, Dict

FilterValueType = Union[str, float, bool, List[str]]

class PineconeRetriever(BaseRetriever):
    """
    A custom retriever that leverages a Pinecone vector store for querying vectors based on the similarity of their embeddings,
    incorporating metadata filters to refine search results according to specific criteria.

    :param vector_store: The Pinecone vector store used for storing and retrieving vectors.
    :param embed_model: The model used to generate embeddings.
    :param query_mode: Mode of the query, defaults to "default".
    :param similarity_top_k: Number of top similar items to retrieve, defaults to 10.
    """

    def __init__(
        self,
        vector_store: PineconeVectorStore,
        embed_model: Optional[BaseEmbedding] = None,
        query_mode: str = "default",
        similarity_top_k: int = 10
    ) -> None:
        """
        Initializes the PineconeRetriever with necessary components for executing a retrieval task.

        :param vector_store: The vector storage system where vectors are indexed.
        :param embed_model: The model used to generate embeddings.
        :param query_mode: The querying mode, defaults to 'default'.
        :param similarity_top_k: The number of top results to return, defaults to 10.
        """
        self._vector_store = vector_store
        self._embed_model = embed_model
        self._query_mode = query_mode
        self._similarity_top_k = similarity_top_k
        self._filters = {}
        super().__init__()

    def set_filters(self, filters: Dict[str, FilterValueType]) -> None:
        """
        Sets the metadata filter for querying the vector store.

        :param filters: Metadata key-value pairs used to refine the search.
        """
        self._filters = filters

    def _retrieve(
        self, 
        query_bundle: QueryBundle, 
    ) -> List[NodeWithScore]:
        """
        Private method to execute the core retrieval logic. This method is called by the public `retrieve` method,
        facilitating the retrieval of the most relevant nodes based on the provided query bundle and optional metadata filters.

        This method should not be called directly; instead, use the `retrieve` method which ensures the proper handling
        of additional preprocessing and postprocessing steps if necessary.

        :param query_bundle: The query and potential embedding provided by the user.
        :return: A list of nodes with their associated scores based on the similarity of their vectors.
        """

        if query_bundle.embedding is None:
            if self._embed_model is None:
                raise ValueError("Embedding model is not available to generate query embeddings.")
            query_embedding = self._embed_model.get_query_embedding(query_bundle.query_str)
        else:
            query_embedding = query_bundle.embedding

        vector_store_query = VectorStoreQuery(
            query_embedding=query_embedding,
            similarity_top_k=self._similarity_top_k,
            mode=self._query_mode,
        )

        query_result = self._vector_store.query(vector_store_query, pinecone_query_filters=self._filters)
        nodes_with_scores = [
            NodeWithScore(node=node, score=query_result.similarities[i] if query_result.similarities else None)
            for i, node in enumerate(query_result.nodes)
        ]

        return nodes_with_scores
dosubot[bot] commented 1 month ago

Hey @MatinKhajavi! I'm a bot here to help you with bug fixes, questions, and becoming a contributor. I'm here to assist you while waiting for a human maintainer.

Your current approach of using a set_filters function to handle query-time metadata filtering in a Pinecone Retriever is indeed a standard and efficient method. This allows you to set different filters for each query dynamically.

The documentation does not mention a filters attribute for QueryBundle, and there is no indication that such an attribute is necessary. The current method using MetadataFilters and set_filters appears to be the best approach.

Here is an example of setting up and using metadata filters:

from llama_index.core.vector_stores import (
    MetadataFilter,
    MetadataFilters,
    FilterOperator,
    FilterCondition,
)

# Example of setting up a single filter
filters = MetadataFilters(
    filters=[
        MetadataFilter(
            key="theme", operator=FilterOperator.EQ, value="Fiction"
        ),
    ]
)

# Example of setting up multiple filters with AND condition
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="theme", value="Fiction"),
        MetadataFilter(key="year", value=1997, operator=FilterOperator.GT),
    ],
    condition=FilterCondition.AND,
)

# Example of setting up multiple filters with OR condition
filters = MetadataFilters(
    filters=[
        MetadataFilter(key="theme", value="Fiction"),
        MetadataFilter(key="year", value=1997, operator=FilterOperator.GT),
    ],
    condition=FilterCondition.OR,
)

# Using the filters with the retriever
retriever = index.as_retriever(filters=filters)
retriever.retrieve("Harry Potter?")

Additionally, you can use keyword arguments specific to Pinecone:

retriever = index.as_retriever(
    vector_store_kwargs={"filter": {"theme": "Mafia"}}
)
retriever.retrieve("What is inception about?")

The PineconeVectorStore class supports converting standard metadata filters to Pinecone-specific filter dictionaries using the _to_pinecone_filter function, ensuring compatibility and efficiency during query time [1][2][3][4][5].

Your implementation is correct, and there is no need to add a filters attribute to QueryBundle.

To continue talking to Dosu, mention @dosu.