googleapis / langchain-google-alloydb-pg-python

Apache License 2.0
9 stars 9 forks source link

How to Retrieve and Delete Document Chunks by Document Name in AlloyDB using LangChain? #159

Closed nprime496 closed 1 day ago

nprime496 commented 1 week ago

I am following the documentation of LangChain for using AlloyDB to store embeddings.

I want to upload multiple documents to the database and perform similarity searches on them. Since my documents are large, I chunk them into smaller parts. I can successfully upload them to the database using the following code (as shown in the documentation):

import uuid

my_document_name = "legal_text.txt"
# Chunking my document and storing the result in all_texts
all_texts = ["Apples and oranges", "Cars and airplanes", "Pineapple", "Train", "Banana"]
metadatas = [{"len": len(t), "document_id": my_document_name} for t in all_texts]

ids = [str(uuid.uuid4()) for _ in all_texts] #IMPORTANT

await store.aadd_texts(all_texts, metadatas=metadatas, ids=ids)

Now, I want to delete all text chunks related to a specific file. For example:

def delete_my_file(document_name):
    # code logic to delete the file chunks

This is where I am stuck. The only code available in the documentation to delete entries is:

await store.adelete([ids[1]])

However, this code requires knowledge of the UUIDs for each text chunk, which are generated on the fly and cannot be inferred directly from the document name (look at the code above) unless I am mistaken.

I understand that it is a best practice to use random UUIDs for Id but my question is:

How can I retrieve all UUIDs for a specific document so I can delete its text chunks in AlloyDB using Langchain ?

More generally, using your package, how to search for docs withs some characterictics ? (Ex: a "source" field in records)

Note:

I found this issue on github related to the same problem but no clear solution is presented.

MarkEdmondson1234 commented 1 week ago

Not a direct answer, but you can use Postgres SQL directly to do tasks like this, so for example I have this function utility:

def delete_row_from_source(source: str, vector_name:str):
    # adapt the user input and decode from bytes to string to protect against sql injection
    try:
        import psycopg2
        from psycopg2.extensions import adapt
    except ImportError:
        log.error("Couldn't import psycopg2 - please install via 'pip install psycopg2'") 

    source = adapt(source).getquoted().decode()
    sql_params = {'source_delete': source}
    sql = f"""
        DELETE FROM {vector_name}
        WHERE metadata->>'source' = %(source_delete)s
    """

    do_sql(sql, sql_params=sql_params, connection_env=lookup_connection_env(vector_name))

I also add the document as a 'source' column to make it easier to delete all those rows eg.

    saver = AlloyDBDocumentSaver.create_sync(
        engine=engine,
        table_name=table_name,
        metadata_columns=["source", "doc_id"]
    )

Hope that helps.

averikitsch commented 1 week ago

Most Vector store only support deletion by ID, but we could identify a feature request to improve the experience.

The quickest way to delete the documents for your use case is similar to Mark's answer. You can use the AlloyDBLoader to pull the documents to retrieve the ids to be delete. For example:

loader = await AlloyDBLoader.create(
  engine, 
  query=f"SELECT * FROM {TABLE_NAME} WHERE langchain_metadata->>'document_id' = 'legal_text.txt';"
)
docs = await loader.aload()
ids = [doc.metadata["document_id"] for doc in docs]
print(ids)

If you plan on using this long term, I would recommend building a Vector Store with document_id as a column and not nested into the metadata.

from langchain_google_alloydb_pg import Column

# Set table name
TABLE_NAME = "vectorstore_custom"

# create a table with document_id column
await engine.ainit_vectorstore_table(
    table_name=TABLE_NAME,
    vector_size=768, 
    metadata_columns=[Column("document_id", "TEXT")],
)

# create a vector store that unnests document_id from the rest of the metadata
custom_store = await AlloyDBVectorStore.create(
    engine=engine,
    table_name=TABLE_NAME,
    embedding_service=embedding,
    metadata_columns=["document_id"],
)

# Use loader to grab document id
loader = await AlloyDBLoader.create(
  engine, 
  query=f"SELECT * FROM {TABLE_NAME} WHERE document_id = 'legal_text.txt';"
)
docs = await loader.aload()
ids = [doc.metadata["document_id"] for doc in docs]

Please let me know if you have questions.