run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.44k stars 4.69k forks source link

[Question]: Does LLamaIndex retriever implicitly filters/slices data based on metadata? #13972

Open SiddharthMurjani opened 1 month ago

SiddharthMurjani commented 1 month ago

Question Validation

Question

I am building a chatbot that does RAG over documents that are ingested are of different companies. I am storing these documents embeddings in one table only. So while retrieval process, will the framework filter the retrieval based on metadata of company (stored in embedding in metadata_ column) implicitly?

dosubot[bot] commented 1 month ago

Hey @SiddharthMurjani! I'm here to assist you with any bugs, questions, or contributions you might have. Let's tackle this together while we wait for the human maintainer.

Yes, the LLamaIndex retriever can filter or slice data based on the metadata stored in the metadata_ column of the embeddings. You can explicitly set up metadata filters to ensure that only documents matching specific metadata criteria are retrieved.

For example, to filter documents based on the company metadata, you can use the following code:

from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="company", operator=FilterOperator.EQ, value="YourCompanyName"),
    ]
)

retriever = index.as_retriever(filters=filters)
retriever.retrieve("Your query here")

This code sets up a filter to only retrieve documents where the company metadata is equal to "YourCompanyName" [1][2][3][4].

To continue talking to Dosu, mention @dosu.

SiddharthMurjani commented 1 month ago

Are there any performance implications if the embeddings table is large?

dosubot[bot] commented 1 month ago

Yes, there are performance implications if the embeddings table is large when using LLamaIndex for retrieval with metadata filtering. Larger tables may require more computational resources and time to filter and retrieve the relevant data. The need for filtering before the similarity search and the complexity of the underlying database operations suggest that larger tables could impact the efficiency of these operations [1][2][3].

To continue talking to Dosu, mention @dosu.

SiddharthMurjani commented 1 month ago

Is there any way we can add custom field to this table? Is there a way we can setup llama index to use this custom field while retrieving embeddings? Is there a way we can avoid using metadata field for filtering?

    class AbstractData(base):  # type: ignore
        __abstract__ = True  # this line is necessary
        id = Column(BIGINT, primary_key=True, autoincrement=True)
        text = Column(VARCHAR, nullable=False)
        metadata_ = Column(metadata_dtype)
        node_id = Column(VARCHAR)
        embedding = Column(Vector(embed_dim))  # type: ignore
dosubot[bot] commented 1 month ago

To add a custom field to the AbstractData table, set up LLamaIndex to use this custom field while retrieving embeddings, and avoid using the metadata_ field for filtering, you can follow these steps:

  1. Add the Custom Field to the AbstractData Table: Modify the get_data_model function to include your custom field. For example, let's add a custom field named custom_field of type VARCHAR.

    def get_data_model(
       base: Type,
       index_name: str,
       schema_name: str,
       hybrid_search: bool,
       text_search_config: str,
       cache_okay: bool,
       embed_dim: int = 1536,
       m: int = 16,
       ef_construction: int = 128,
       ef: int = 64,
    ) -> Any:
       from sqlalchemy import Column, Computed
       from sqlalchemy.dialects.postgresql import ARRAY, BIGINT, JSON, REAL, TSVECTOR, VARCHAR
       from sqlalchemy.schema import Index
       from sqlalchemy.types import TypeDecorator
    
       class TSVector(TypeDecorator):
           impl = TSVECTOR
           cache_ok = cache_okay
    
       tablename = "data_%s" % index_name  # dynamic table name
       class_name = "Data%s" % index_name  # dynamic class name
       indexname = "%s_idx" % index_name  # dynamic index name
       hnsw_indexname = "%s_hnsw_idx" % index_name  # dynamic hnsw index name
    
       if hybrid_search:
           class HybridAbstractData(base):  # type: ignore
               __abstract__ = True  # this line is necessary
               id = Column(BIGINT, primary_key=True, autoincrement=True)
               text = Column(VARCHAR, nullable=False)
               metadata_ = Column(JSON)
               node_id = Column(VARCHAR)
               embedding = Column(ARRAY(REAL, embed_dim))  # type: ignore
               text_search_tsv = Column(  # type: ignore
                   TSVector(),
                   Computed(
                       "to_tsvector('%s', text)" % text_search_config, persisted=True
                   ),
               )
               custom_field = Column(VARCHAR)  # Add custom field here
    
           model = type(
               class_name,
               (HybridAbstractData,),
               {"__tablename__": tablename, "__table_args__": {"schema": schema_name}},
           )
    
           Index(
               indexname,
               model.text_search_tsv,  # type: ignore
               postgresql_using="gin",
           )
       else:
           class AbstractData(base):  # type: ignore
               __abstract__ = True  # this line is necessary
               id = Column(BIGINT, primary_key=True, autoincrement=True)
               text = Column(VARCHAR, nullable=False)
               metadata_ = Column(JSON)
               node_id = Column(VARCHAR)
               embedding = Column(ARRAY(REAL, embed_dim))  # type: ignore
               custom_field = Column(VARCHAR)  # Add custom field here
    
           model = type(
               class_name,
               (AbstractData,),
               {"__tablename__": tablename, "__table_args__": {"schema": schema_name}},
           )
    
       Index(
           hnsw_indexname,
           model.embedding,  # type: ignore
           postgresql_using="lantern_hnsw",
           postgresql_with={
               "m": m,
               "ef_construction": ef_construction,
               "ef": ef,
               "dim": embed_dim,
           },
           postgresql_ops={"embedding": "dist_cos_ops"},
       )
       return model
  2. Set Up LLamaIndex to Use the Custom Field: Modify the _node_to_table_row method to include the custom field when adding nodes.

    def _node_to_table_row(self, node: BaseNode) -> Any:
       return self._table_class(
           node_id=node.node_id,
           embedding=node.get_embedding(),
           text=node.get_content(metadata_mode=MetadataMode.NONE),
           metadata_=node_to_metadata_dict(
               node,
               remove_text=True,
               flat_metadata=self.flat_metadata,
           ),
           custom_field=node.custom_field  # Include custom field here
       )
  3. Avoid Using the metadata_ Field for Filtering: Modify the _apply_filters_and_limit method to use the custom field for filtering instead of the metadata_ field.

    def _apply_filters_and_limit(
       self,
       stmt: Select,
       limit: int,
       metadata_filters: Optional[MetadataFilters] = None,
    ) -> Any:
       import sqlalchemy
    
       if metadata_filters:
           for filter_ in metadata_filters.legacy_filters():
               bind_parameter = f"value_{filter_.key}"
               stmt = stmt.where(  # type: ignore
                   sqlalchemy.text(f"custom_field = :{bind_parameter}")
               )
               stmt = stmt.params(  # type: ignore
                   **{bind_parameter: str(filter_.value)}
               )
       return stmt.limit(limit)  # type: ignore

By following these steps, you can add a custom field to the AbstractData table, set up LLamaIndex to use this custom field while retrieving embeddings, and avoid using the metadata_ field for filtering [1][2][3][4].

To continue talking to Dosu, mention @dosu.