Open igaurab opened 1 day ago
To add new fields like project_id
to the PGVectorStore
schema in LlamaIndex, you can modify the dynamic SQLAlchemy model created in the get_data_model
function. You would add a new column for project_id
in the HybridAbstractData
or AbstractData
class, depending on whether hybrid search is enabled.
Here's an example of how you can add a project_id
field:
from sqlalchemy import Column, VARCHAR
# Inside the get_data_model function
if hybrid_search:
class HybridAbstractData(base): # type: ignore
__abstract__ = True
id = Column(BIGINT, primary_key=True, autoincrement=True)
text = Column(VARCHAR, nullable=False)
metadata_ = Column(metadata_dtype)
node_id = Column(VARCHAR)
project_id = Column(VARCHAR) # Add this line for project_id
embedding = Column(Vector(embed_dim)) # type: ignore
text_search_tsv = Column(
TSVector(),
Computed(
"to_tsvector('%s', text)" % text_search_config, persisted=True
),
)
model = type(
class_name,
(HybridAbstractData,),
{"__tablename__": tablename, "__table_args__": {"schema": schema_name}},
)
else:
class AbstractData(base): # type: ignore
__abstract__ = True
id = Column(BIGINT, primary_key=True, autoincrement=True)
text = Column(VARCHAR, nullable=False)
metadata_ = Column(metadata_dtype)
node_id = Column(VARCHAR)
project_id = Column(VARCHAR) # Add this line for project_id
embedding = Column(Vector(embed_dim)) # type: ignore
model = type(
class_name,
(AbstractData,),
{"__tablename__": tablename, "__table_args__": {"schema": schema_name}},
)
This code snippet shows how to add a project_id
column to the schema. You can adjust the column type and constraints as needed [1].
Regarding defining relationships with other tables, the repository does not provide specific guidance on handling foreign key constraints or defining relationships. You might need to manually handle these aspects using SQLAlchemy's relationship and foreign key features.
To continue talking to Dosu, mention @dosu.
@dosu I see we can also pass the table name when creating an index
vector_store = PGVectorStore.from_params(
database=db_name,
host=url.host,
password=url.password,
port=url.port,
user=url.username,
table_name=<table_name>,
embed_dim=1536, # openai embedding dimension
)
what columns should be present in my table so i can pass this existing table and llamaindex will be able to use this table by default? Can you give me the list of mandatory columns i must have so i can also do hybrid search?
@dosu is there a way i can give llamaindex a bunch of text and ask it to just generate the embeddings, the node_id text etc structure and i can manually insert this data whereever i want?
Yes, you can provide LlamaIndex with a set of text inputs and have it generate the embeddings, node_id, and text structure. Here's a simplified version of the steps:
Load Documents: Use a document loader to load your text data.
from llama_index.readers.file import PyMuPDFReader
loader = PyMuPDFReader()
documents = loader.load(file_path="./data/your_document.pdf")
Parse into Nodes: Use a node parser to split the documents into nodes.
from llama_index.core.node_parser import SentenceSplitter
node_parser = SentenceSplitter(chunk_size=256)
nodes = node_parser.get_nodes_from_documents(documents)
Generate Embeddings: Use an embedding model to generate embeddings for each node.
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding()
for node in nodes:
node_embedding = embed_model.get_text_embedding(
node.get_content(metadata_mode="all")
)
node.embedding = node_embedding
This process will give you the embeddings, node_id, and text structure, which you can then manually insert elsewhere as needed [1].
To continue talking to Dosu, mention @dosu.
@dosu now if i want to insert this data into postgresql without using pgvector so that it is similar to how pgvectorstore class stores this data how can i do so?
To insert data into PostgreSQL in a manner similar to how the PGVectorStore
class stores it, you can use SQLAlchemy to define a table schema and insert the data. Here's a basic example of how you can achieve this:
PGVectorStore
.from sqlalchemy import create_engine, Column, BIGINT, VARCHAR, JSONB
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from pgvector.sqlalchemy import Vector
Base = declarative_base()
class YourDataModel(Base):
__tablename__ = 'your_table_name'
id = Column(BIGINT, primary_key=True, autoincrement=True)
text = Column(VARCHAR, nullable=False)
metadata_ = Column(JSONB)
node_id = Column(VARCHAR)
embedding = Column(Vector(1536)) # Assuming 1536 is your embedding dimension
# Create an engine and session
engine = create_engine('postgresql://username:password@localhost/dbname')
Session = sessionmaker(bind=engine)
session = Session()
# Assuming you have a list of nodes with text, node_id, embedding, and metadata
for node in nodes:
data_entry = YourDataModel(
text=node.get_content(),
metadata_=node.metadata,
node_id=node.node_id,
embedding=node.embedding
)
session.add(data_entry)
session.commit()
This code sets up a SQLAlchemy model with a schema similar to what PGVectorStore
uses, and it demonstrates how to insert data into PostgreSQL using SQLAlchemy. Adjust the connection string and table name as needed for your setup.
To continue talking to Dosu, mention @dosu.
@dosu just doing node.metadata will not have all the metadata for instance:
{"file_path": "/Users/igaurab/workspaces/xyz/xyzbot/notebooks/data/paul_graham/paul_graham_essay.txt", "file_name": "paul_graham_essay.txt", "file_type": "text/plain", "file_size": 75041, "creation_date": "2024-11-21", "last_modified_date": "2024-11-21", "_node_content": "{\"id_\": \"8f4209f9-4fe6-4fb2-be28-fd1db1298f96\", \"embedding\": null, \"metadata\": {\"file_path\": \"/Users/igaurab/workspaces/xyz/xyzbot/notebooks/data/paul_graham/paul_graham_essay.txt\", \"file_name\": \"paul_graham_essay.txt\", \"file_type\": \"text/plain\", \"file_size\": 75041, \"creation_date\": \"2024-11-21\", \"last_modified_date\": \"2024-11-21\"}, \"excluded_embed_metadata_keys\": [\"file_name\", \"file_type\", \"file_size\", \"creation_date\", \"last_modified_date\", \"last_accessed_date\"], \"excluded_llm_metadata_keys\": [\"file_name\", \"file_type\", \"file_size\", \"creation_date\", \"last_modified_date\", \"last_accessed_date\"], \"relationships\": {\"1\": {\"node_id\": \"2cf31c25-417e-4188-891e-305ce4d9268b\", \"node_type\": \"4\", \"metadata\": {\"file_path\": \"/Users/igaurab/workspaces/xyz/xyzbot/notebooks/data/paul_graham/paul_graham_essay.txt\", \"file_name\": \"paul_graham_essay.txt\", \"file_type\": \"text/plain\", \"file_size\": 75041, \"creation_date\": \"2024-11-21\", \"last_modified_date\": \"2024-11-21\"}, \"hash\": \"2bffa74f0f1af913b61642e8921b4012b3a81218bbdf3c0f3f500c307d67bbf1\", \"class_name\": \"RelatedNodeInfo\"}, \"3\": {\"node_id\": \"3003f8cb-a697-43ec-8dee-a4574984a630\", \"node_type\": \"1\", \"metadata\": {}, \"hash\": \"a86dcfe59f6bd34948c27f71ce3559236525c77d68fd0be377e2548e308f9ec0\", \"class_name\": \"RelatedNodeInfo\"}}, \"text\": \"\", \"mimetype\": \"text/plain\", \"start_char_idx\": 2, \"end_char_idx\": 4172, \"text_template\": \"{metadata_str}\\n\\n{content}\", \"metadata_template\": \"{key}: {value}\", \"metadata_seperator\": \"\\n\", \"class_name\": \"TextNode\"}", "_node_type": "TextNode", "document_id": "2cf31c25-417e-4188-891e-305ce4d9268b", "doc_id": "2cf31c25-417e-4188-891e-305ce4d9268b", "ref_doc_id": "2cf31c25-417e-4188-891e-305ce4d9268b"}
this is the metadata that is stored in database when using PGVectorStore whereas
{'file_path': '/Users/igaurab/workspaces/xyz/xyzbot/notebooks/data/paul_graham/paul_graham_essay.txt',
'file_name': 'paul_graham_essay.txt',
'file_type': 'text/plain',
'file_size': 75041,
'creation_date': '2024-11-21',
'last_modified_date': '2024-11-21'}
this is what i get when i do node.metadata
To capture and store all metadata, including nested structures and additional fields, when using PGVectorStore
, you should use the node_to_metadata_dict
function. This function converts a BaseNode
into a metadata dictionary, capturing the entire node content as a JSON string within the metadata. This approach allows you to store complex and nested metadata structures.
Here's how you can use the node_to_metadata_dict
function:
from llama_index.core.schema import BaseNode
def store_node_with_full_metadata(node: BaseNode):
# Convert the node to a metadata dictionary
metadata = node_to_metadata_dict(node, flat_metadata=False)
# The metadata dictionary now contains the entire node content as a JSON string
# You can store this metadata in your database
# Example: db.store(metadata)
By setting flat_metadata=False
, you avoid the validation that enforces a flat dictionary, allowing you to store nested structures within the _node_content
key as a JSON string. This ensures that all metadata, including nested structures, is captured and stored [1].
To continue talking to Dosu, mention @dosu.
@dosu with some modifications i was able to manually insert the data into a table named data_example
in database named test_db
. Now how can i use this table to query using PgVectorStore class. One requirement is i want to only search on a subset of document inside the data_example
which has a chatbot_id of 1234
. I need to pre-filter these documents.
To query the data_example
table in the test_db
database using the PgVectorStore
class and limit the search to documents with a chatbot_id
of 1234
, you can use the MetadataFilters
with an ExactMatchFilter
. Here's how you can do it:
from llama_index.vector_stores.postgres import PGVectorStore
from llama_index.core.vector_stores.types import VectorStoreQuery, MetadataFilters, ExactMatchFilter
# Define connection parameters
PARAMS = {
"host": "localhost",
"user": "postgres",
"password": "mark90",
"port": 5432,
}
# Initialize the PGVectorStore
pg = PGVectorStore.from_params(
**PARAMS,
database="test_db",
table_name="data_example",
embed_dim=2 # Adjust the embed_dim as needed
)
# Define the metadata filter to limit the search to documents with chatbot_id of 1234
filters = MetadataFilters(
filters=[ExactMatchFilter(key="chatbot_id", value=1234)]
)
# Create a query with the specified filters
query = VectorStoreQuery(
query_embedding=[0.1, 0.1], # Example embedding vector
similarity_top_k=10,
filters=filters
)
# Execute the query
result = pg.query(query)
# Process the result
for node in result.nodes:
print(node.node_id, node.text)
# Close the connection
pg.close()
This code sets up a connection to the PostgreSQL database, applies a filter to only include documents with a chatbot_id
of 1234
, and executes the query. Adjust the query_embedding
and embed_dim
as needed for your specific use case [1].
To continue talking to Dosu, mention @dosu.
@dosu I was hoping if i can use this interface
from sqlalchemy import make_url
url = make_url(connection_string)
hybrid_vector_store = PGVectorStore.from_params(
database="xyz_test",
host="localhost",
password="password",
port=5432,
user="postgres",
table_name="data_example",
embed_dim=1536, # openai embedding dimension
hybrid_search=True,
text_search_config="english",
)
index = VectorStoreIndex.from_vector_store(vector_store=hybrid_vector_store)
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
Currently it returns me an 'Empty Response'
how can i add filters in this query?
To add filters to your query using the VectorStoreIndex
and PGVectorStore
, you can use the MetadataFilters
class to specify the filters you want to apply. Here's how you can modify your query to include filters:
from llama_index.core.vector_stores.types import VectorStoreQuery, MetadataFilters, MetadataFilter, FilterOperator
# Define the metadata filter to limit the search to documents with chatbot_id of 1234
filters = MetadataFilters(
filters=[
MetadataFilter(
key="chatbot_id",
value=["1234"],
operator=FilterOperator.IN
)
]
)
# Create a query with the specified filters
query = VectorStoreQuery(
query_str="What did the author do?",
similarity_top_k=10,
filters=filters
)
# Execute the query
response = query_engine.query(query)
In this example, the MetadataFilters
object is created with a list of MetadataFilter
objects. Each MetadataFilter
specifies a key, a value, and an operator (e.g., FilterOperator.IN
). The VectorStoreQuery
is then created with these filters, and the query is executed using the query_engine
. This setup ensures that the query results are filtered based on the specified metadata, reducing the likelihood of an 'Empty Response' [1].
To continue talking to Dosu, mention @dosu.
To add filters to your query using the VectorStoreIndex
and PGVectorStore
, you can use the MetadataFilters
class to specify the filters you want to apply. Here's how you can modify your query to include filters:
from llama_index.core.vector_stores.types import VectorStoreQuery, MetadataFilters, MetadataFilter, FilterOperator
# Define the metadata filter to limit the search to documents with chatbot_id of 1234
filters = MetadataFilters(
filters=[
MetadataFilter(
key="chatbot_id",
value=["1234"],
operator=FilterOperator.IN
)
]
)
# Create a query with the specified filters
query = VectorStoreQuery(
query_embedding=[0.1, 0.1], # Example embedding vector
similarity_top_k=10,
filters=filters
)
# Execute the query using the query engine
response = query_engine.query(query)
In this example, the MetadataFilters
object is created with a list of MetadataFilter
objects. Each MetadataFilter
specifies a key, a value, and an operator (e.g., FilterOperator.IN
). The VectorStoreQuery
is then created with these filters, and the query is executed using the query_engine
. This setup ensures that the query results are filtered based on the specified metadata, reducing the likelihood of an 'Empty Response' [1].
To continue talking to Dosu, mention @dosu.
i got it to work, if you pass the database name as example
then llamaidnex will automatically prepend it with data_
in my case my table name was data_example
so it was searching for data_data_example
figured this out by setting the debug=True
flag in the pgvectorstore. then i only passed example
as my table name and it used the table data_example
. Is there a way to turn this prepending of data_
off? @dosu
There isn't a built-in configuration option or parameter in PGVectorStore
to disable the automatic prepending of 'data_' to table names. The table name is modified in the get_data_model
function within the llama_index/vector_stores/postgres/base.py
file, where it is dynamically constructed by prefixing 'data_' to the index_name
[1].
To continue talking to Dosu, mention @dosu.
@dosu
from sqlalchemy import make_url
url = make_url(connection_string)
hybrid_vector_store = PGVectorStore.from_params(
database="xyz_test",
host="localhost",
password="password",
port=5432,
user="postgres",
table_name="example",
embed_dim=1536,
hybrid_search=True,
text_search_config="english",
)
index = VectorStoreIndex.from_vector_store(vector_store=hybrid_vector_store)
query_engine = index.as_query_engine(vector_store_query_mode="hybrid", sparse_top_k=10)
response = query_engine.query("who is paul graham?")
in the table example
i have a column named chatbot_id
and i want to only query the documents that has chatbot_id=1
how do i do this. I tried the metadafilter but it only seems to work on metadata stored by llamaindex.
To query documents with a specific chatbot_id
using the PGVectorStore
and VectorStoreIndex
, you need to ensure that the chatbot_id
is included in the metadata that LlamaIndex uses for filtering. If chatbot_id
is not part of the metadata structure that LlamaIndex recognizes, you might need to adjust how the metadata is handled or ensure that chatbot_id
is included in the metadata when documents are indexed.
Here's how you can apply a filter using MetadataFilters
:
from llama_index.core.vector_stores.types import VectorStoreQuery, MetadataFilters, MetadataFilter, FilterOperator
# Define the metadata filter to limit the search to documents with chatbot_id of 1
filters = MetadataFilters(
filters=[
MetadataFilter(
key="chatbot_id",
value=["1"],
operator=FilterOperator.IN
)
]
)
# Create a query with the specified filters
query = VectorStoreQuery(
query_str="who is paul graham?",
similarity_top_k=10,
filters=filters
)
# Execute the query using the query engine
response = query_engine.query(query)
Ensure that the chatbot_id
is included in the metadata when documents are indexed. If it's not, you may need to modify the indexing process to include chatbot_id
in the metadata that LlamaIndex uses. If chatbot_id
is a separate column and not part of the metadata, you might need to adjust the way the data is stored or queried to ensure it is recognized as part of the metadata filtering process.
To continue talking to Dosu, mention @dosu.
Question Validation
Question
Can we add extra column in case of PGVectorStore in llamaindex. I wanted to define relationship among other table i have and also add fileds like project_id so i can query the data for a particular project.