Closed nprime496 closed 1 day ago
Not a direct answer, but you can use Postgres SQL directly to do tasks like this, so for example I have this function utility:
def delete_row_from_source(source: str, vector_name:str):
# adapt the user input and decode from bytes to string to protect against sql injection
try:
import psycopg2
from psycopg2.extensions import adapt
except ImportError:
log.error("Couldn't import psycopg2 - please install via 'pip install psycopg2'")
source = adapt(source).getquoted().decode()
sql_params = {'source_delete': source}
sql = f"""
DELETE FROM {vector_name}
WHERE metadata->>'source' = %(source_delete)s
"""
do_sql(sql, sql_params=sql_params, connection_env=lookup_connection_env(vector_name))
I also add the document as a 'source' column to make it easier to delete all those rows eg.
saver = AlloyDBDocumentSaver.create_sync(
engine=engine,
table_name=table_name,
metadata_columns=["source", "doc_id"]
)
Hope that helps.
Most Vector store only support deletion by ID, but we could identify a feature request to improve the experience.
The quickest way to delete the documents for your use case is similar to Mark's answer. You can use the AlloyDBLoader
to pull the documents to retrieve the ids to be delete. For example:
loader = await AlloyDBLoader.create(
engine,
query=f"SELECT * FROM {TABLE_NAME} WHERE langchain_metadata->>'document_id' = 'legal_text.txt';"
)
docs = await loader.aload()
ids = [doc.metadata["document_id"] for doc in docs]
print(ids)
If you plan on using this long term, I would recommend building a Vector Store with document_id as a column and not nested into the metadata.
from langchain_google_alloydb_pg import Column
# Set table name
TABLE_NAME = "vectorstore_custom"
# create a table with document_id column
await engine.ainit_vectorstore_table(
table_name=TABLE_NAME,
vector_size=768,
metadata_columns=[Column("document_id", "TEXT")],
)
# create a vector store that unnests document_id from the rest of the metadata
custom_store = await AlloyDBVectorStore.create(
engine=engine,
table_name=TABLE_NAME,
embedding_service=embedding,
metadata_columns=["document_id"],
)
# Use loader to grab document id
loader = await AlloyDBLoader.create(
engine,
query=f"SELECT * FROM {TABLE_NAME} WHERE document_id = 'legal_text.txt';"
)
docs = await loader.aload()
ids = [doc.metadata["document_id"] for doc in docs]
Please let me know if you have questions.
I am following the documentation of LangChain for using AlloyDB to store embeddings.
I want to upload multiple documents to the database and perform similarity searches on them. Since my documents are large, I chunk them into smaller parts. I can successfully upload them to the database using the following code (as shown in the documentation):
Now, I want to delete all text chunks related to a specific file. For example:
This is where I am stuck. The only code available in the documentation to delete entries is:
However, this code requires knowledge of the UUIDs for each text chunk, which are generated on the fly and cannot be inferred directly from the document name (look at the code above) unless I am mistaken.
I understand that it is a best practice to use random UUIDs for Id but my question is:
How can I retrieve all UUIDs for a specific document so I can delete its text chunks in AlloyDB using Langchain ?
More generally, using your package, how to search for docs withs some characterictics ? (Ex: a "source" field in records)
Note:
I found this issue on github related to the same problem but no clear solution is presented.