[Question]: Does LLamaIndex retriever implicitly filters/slices data based on metadata?

SiddharthMurjani commented 1 month ago

Question Validation

[x] I have searched both the documentation and discord for an answer.

Question

I am building a chatbot that does RAG over documents that are ingested are of different companies. I am storing these documents embeddings in one table only. So while retrieval process, will the framework filter the retrieval based on metadata of company (stored in embedding in metadata_ column) implicitly?

dosubot[bot] commented 1 month ago

Hey @SiddharthMurjani! I'm here to assist you with any bugs, questions, or contributions you might have. Let's tackle this together while we wait for the human maintainer.

Yes, the LLamaIndex retriever can filter or slice data based on the metadata stored in the metadata_ column of the embeddings. You can explicitly set up metadata filters to ensure that only documents matching specific metadata criteria are retrieved.

For example, to filter documents based on the company metadata, you can use the following code:

from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator

filters = MetadataFilters(
    filters=[
        MetadataFilter(key="company", operator=FilterOperator.EQ, value="YourCompanyName"),
    ]
)

retriever = index.as_retriever(filters=filters)
retriever.retrieve("Your query here")

This code sets up a filter to only retrieve documents where the company metadata is equal to "YourCompanyName" ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

SiddharthMurjani commented 1 month ago

Are there any performance implications if the embeddings table is large?

dosubot[bot] commented 1 month ago

Yes, there are performance implications if the embeddings table is large when using LLamaIndex for retrieval with metadata filtering. Larger tables may require more computational resources and time to filter and retrieve the relevant data. The need for filtering before the similarity search and the complexity of the underlying database operations suggest that larger tables could impact the efficiency of these operations ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

SiddharthMurjani commented 1 month ago

Is there any way we can add custom field to this table? Is there a way we can setup llama index to use this custom field while retrieving embeddings? Is there a way we can avoid using metadata field for filtering?

    class AbstractData(base):  # type: ignore
        __abstract__ = True  # this line is necessary
        id = Column(BIGINT, primary_key=True, autoincrement=True)
        text = Column(VARCHAR, nullable=False)
        metadata_ = Column(metadata_dtype)
        node_id = Column(VARCHAR)
        embedding = Column(Vector(embed_dim))  # type: ignore

dosubot[bot] commented 1 month ago

To add a custom field to the AbstractData table, set up LLamaIndex to use this custom field while retrieving embeddings, and avoid using the metadata_ field for filtering, you can follow these steps:

Add the Custom Field to the AbstractData Table: Modify the get_data_model function to include your custom field. For example, let's add a custom field named custom_field of type VARCHAR.

def get_data_model(
   base: Type,
   index_name: str,
   schema_name: str,
   hybrid_search: bool,
   text_search_config: str,
   cache_okay: bool,
   embed_dim: int = 1536,
   m: int = 16,
   ef_construction: int = 128,
   ef: int = 64,
) -> Any:
   from sqlalchemy import Column, Computed
   from sqlalchemy.dialects.postgresql import ARRAY, BIGINT, JSON, REAL, TSVECTOR, VARCHAR
   from sqlalchemy.schema import Index
   from sqlalchemy.types import TypeDecorator

   class TSVector(TypeDecorator):
       impl = TSVECTOR
       cache_ok = cache_okay

   tablename = "data_%s" % index_name  # dynamic table name
   class_name = "Data%s" % index_name  # dynamic class name
   indexname = "%s_idx" % index_name  # dynamic index name
   hnsw_indexname = "%s_hnsw_idx" % index_name  # dynamic hnsw index name

   if hybrid_search:
       class HybridAbstractData(base):  # type: ignore
           __abstract__ = True  # this line is necessary
           id = Column(BIGINT, primary_key=True, autoincrement=True)
           text = Column(VARCHAR, nullable=False)
           metadata_ = Column(JSON)
           node_id = Column(VARCHAR)
           embedding = Column(ARRAY(REAL, embed_dim))  # type: ignore
           text_search_tsv = Column(  # type: ignore
               TSVector(),
               Computed(
                   "to_tsvector('%s', text)" % text_search_config, persisted=True
               ),
           )
           custom_field = Column(VARCHAR)  # Add custom field here

       model = type(
           class_name,
           (HybridAbstractData,),
           {"__tablename__": tablename, "__table_args__": {"schema": schema_name}},
       )

       Index(
           indexname,
           model.text_search_tsv,  # type: ignore
           postgresql_using="gin",
       )
   else:
       class AbstractData(base):  # type: ignore
           __abstract__ = True  # this line is necessary
           id = Column(BIGINT, primary_key=True, autoincrement=True)
           text = Column(VARCHAR, nullable=False)
           metadata_ = Column(JSON)
           node_id = Column(VARCHAR)
           embedding = Column(ARRAY(REAL, embed_dim))  # type: ignore
           custom_field = Column(VARCHAR)  # Add custom field here

       model = type(
           class_name,
           (AbstractData,),
           {"__tablename__": tablename, "__table_args__": {"schema": schema_name}},
       )

   Index(
       hnsw_indexname,
       model.embedding,  # type: ignore
       postgresql_using="lantern_hnsw",
       postgresql_with={
           "m": m,
           "ef_construction": ef_construction,
           "ef": ef,
           "dim": embed_dim,
       },
       postgresql_ops={"embedding": "dist_cos_ops"},
   )
   return model

Set Up LLamaIndex to Use the Custom Field: Modify the _node_to_table_row method to include the custom field when adding nodes.

def _node_to_table_row(self, node: BaseNode) -> Any:
   return self._table_class(
       node_id=node.node_id,
       embedding=node.get_embedding(),
       text=node.get_content(metadata_mode=MetadataMode.NONE),
       metadata_=node_to_metadata_dict(
           node,
           remove_text=True,
           flat_metadata=self.flat_metadata,
       ),
       custom_field=node.custom_field  # Include custom field here
   )

Avoid Using the metadata_ Field for Filtering: Modify the _apply_filters_and_limit method to use the custom field for filtering instead of the metadata_ field.

def _apply_filters_and_limit(
   self,
   stmt: Select,
   limit: int,
   metadata_filters: Optional[MetadataFilters] = None,
) -> Any:
   import sqlalchemy

   if metadata_filters:
       for filter_ in metadata_filters.legacy_filters():
           bind_parameter = f"value_{filter_.key}"
           stmt = stmt.where(  # type: ignore
               sqlalchemy.text(f"custom_field = :{bind_parameter}")
           )
           stmt = stmt.params(  # type: ignore
               **{bind_parameter: str(filter_.value)}
           )
   return stmt.limit(limit)  # type: ignore

By following these steps, you can add a custom field to the AbstractData table, set up LLamaIndex to use this custom field while retrieving embeddings, and avoid using the metadata_ field for filtering ^[1]^[2]^[3]^[4].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Question]: Does LLamaIndex retriever implicitly filters/slices data based on metadata? #13972

Question Validation

Question