Open SiddharthMurjani opened 1 month ago
Hey @SiddharthMurjani! I'm here to assist you with any bugs, questions, or contributions you might have. Let's tackle this together while we wait for the human maintainer.
Yes, the LLamaIndex retriever can filter or slice data based on the metadata stored in the metadata_
column of the embeddings. You can explicitly set up metadata filters to ensure that only documents matching specific metadata criteria are retrieved.
For example, to filter documents based on the company
metadata, you can use the following code:
from llama_index.core.vector_stores import MetadataFilter, MetadataFilters, FilterOperator
filters = MetadataFilters(
filters=[
MetadataFilter(key="company", operator=FilterOperator.EQ, value="YourCompanyName"),
]
)
retriever = index.as_retriever(filters=filters)
retriever.retrieve("Your query here")
This code sets up a filter to only retrieve documents where the company
metadata is equal to "YourCompanyName" [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Are there any performance implications if the embeddings table is large?
Yes, there are performance implications if the embeddings table is large when using LLamaIndex for retrieval with metadata filtering. Larger tables may require more computational resources and time to filter and retrieve the relevant data. The need for filtering before the similarity search and the complexity of the underlying database operations suggest that larger tables could impact the efficiency of these operations [1][2][3].
To continue talking to Dosu, mention @dosu.
Is there any way we can add custom field to this table? Is there a way we can setup llama index to use this custom field while retrieving embeddings? Is there a way we can avoid using metadata field for filtering?
class AbstractData(base): # type: ignore
__abstract__ = True # this line is necessary
id = Column(BIGINT, primary_key=True, autoincrement=True)
text = Column(VARCHAR, nullable=False)
metadata_ = Column(metadata_dtype)
node_id = Column(VARCHAR)
embedding = Column(Vector(embed_dim)) # type: ignore
To add a custom field to the AbstractData
table, set up LLamaIndex to use this custom field while retrieving embeddings, and avoid using the metadata_
field for filtering, you can follow these steps:
Add the Custom Field to the AbstractData
Table:
Modify the get_data_model
function to include your custom field. For example, let's add a custom field named custom_field
of type VARCHAR
.
def get_data_model(
base: Type,
index_name: str,
schema_name: str,
hybrid_search: bool,
text_search_config: str,
cache_okay: bool,
embed_dim: int = 1536,
m: int = 16,
ef_construction: int = 128,
ef: int = 64,
) -> Any:
from sqlalchemy import Column, Computed
from sqlalchemy.dialects.postgresql import ARRAY, BIGINT, JSON, REAL, TSVECTOR, VARCHAR
from sqlalchemy.schema import Index
from sqlalchemy.types import TypeDecorator
class TSVector(TypeDecorator):
impl = TSVECTOR
cache_ok = cache_okay
tablename = "data_%s" % index_name # dynamic table name
class_name = "Data%s" % index_name # dynamic class name
indexname = "%s_idx" % index_name # dynamic index name
hnsw_indexname = "%s_hnsw_idx" % index_name # dynamic hnsw index name
if hybrid_search:
class HybridAbstractData(base): # type: ignore
__abstract__ = True # this line is necessary
id = Column(BIGINT, primary_key=True, autoincrement=True)
text = Column(VARCHAR, nullable=False)
metadata_ = Column(JSON)
node_id = Column(VARCHAR)
embedding = Column(ARRAY(REAL, embed_dim)) # type: ignore
text_search_tsv = Column( # type: ignore
TSVector(),
Computed(
"to_tsvector('%s', text)" % text_search_config, persisted=True
),
)
custom_field = Column(VARCHAR) # Add custom field here
model = type(
class_name,
(HybridAbstractData,),
{"__tablename__": tablename, "__table_args__": {"schema": schema_name}},
)
Index(
indexname,
model.text_search_tsv, # type: ignore
postgresql_using="gin",
)
else:
class AbstractData(base): # type: ignore
__abstract__ = True # this line is necessary
id = Column(BIGINT, primary_key=True, autoincrement=True)
text = Column(VARCHAR, nullable=False)
metadata_ = Column(JSON)
node_id = Column(VARCHAR)
embedding = Column(ARRAY(REAL, embed_dim)) # type: ignore
custom_field = Column(VARCHAR) # Add custom field here
model = type(
class_name,
(AbstractData,),
{"__tablename__": tablename, "__table_args__": {"schema": schema_name}},
)
Index(
hnsw_indexname,
model.embedding, # type: ignore
postgresql_using="lantern_hnsw",
postgresql_with={
"m": m,
"ef_construction": ef_construction,
"ef": ef,
"dim": embed_dim,
},
postgresql_ops={"embedding": "dist_cos_ops"},
)
return model
Set Up LLamaIndex to Use the Custom Field:
Modify the _node_to_table_row
method to include the custom field when adding nodes.
def _node_to_table_row(self, node: BaseNode) -> Any:
return self._table_class(
node_id=node.node_id,
embedding=node.get_embedding(),
text=node.get_content(metadata_mode=MetadataMode.NONE),
metadata_=node_to_metadata_dict(
node,
remove_text=True,
flat_metadata=self.flat_metadata,
),
custom_field=node.custom_field # Include custom field here
)
Avoid Using the metadata_
Field for Filtering:
Modify the _apply_filters_and_limit
method to use the custom field for filtering instead of the metadata_
field.
def _apply_filters_and_limit(
self,
stmt: Select,
limit: int,
metadata_filters: Optional[MetadataFilters] = None,
) -> Any:
import sqlalchemy
if metadata_filters:
for filter_ in metadata_filters.legacy_filters():
bind_parameter = f"value_{filter_.key}"
stmt = stmt.where( # type: ignore
sqlalchemy.text(f"custom_field = :{bind_parameter}")
)
stmt = stmt.params( # type: ignore
**{bind_parameter: str(filter_.value)}
)
return stmt.limit(limit) # type: ignore
By following these steps, you can add a custom field to the AbstractData
table, set up LLamaIndex to use this custom field while retrieving embeddings, and avoid using the metadata_
field for filtering [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
I am building a chatbot that does RAG over documents that are ingested are of different companies. I am storing these documents embeddings in one table only. So while retrieval process, will the framework filter the retrieval based on metadata of company (stored in embedding in
metadata_
column) implicitly?