rnadigital / agentcloud

Agent Cloud is like having your own GPT builder with a bunch extra goodies. The GUI features 1) RAG pipeline which can natively embed 260+ datasources 2) Create Conversational apps (like GPTs) 3) Create Multi Agent process automation apps (crewai) 4) Tools 5) Teams+user permissions. Get started fast with Docker and our install.sh
https://agentcloud.dev
GNU Affero General Public License v3.0
531 stars 112 forks source link

Update chunked rows #550

Closed ragyabraham closed 1 month ago

ragyabraham commented 2 months ago

Is your feature request related to a problem? Please describe.

When a chunking strategy is applied to a synced row, a single row results in several vector points. These points have been randomly assigned UUIDs as vector point indexes. To update these points, we need to be able to query and delete the existing points so as not to create duplicates.

Describe the solution you'd like

For every point associated with the original row, we will store the primary key in the index field. This field will be used to perform a vector search and delete prior to upsert operation. This will ensure that points are updated rather than duplicated.

Additional context

We will use the SearchType::ChunkedRow in the vector-db-proxy app to indicate that this operation is required to happen

tomlynchRNA commented 2 months ago

@ragyabraham seems like when I update a row to change the cursor field and one of the metadata fields (primary key stays the same), it still doesn't delete the old points from qdrant.

datasource: image

I have these 6 rows in bq: image

I sync and get 6 points in qdrant: image

Then I update one of the rows: image

Then sync again and have 7 points in qdrant, with both old and new versions of the "#StayProductive" one: image image

Also note: my chunkingConfig is set to max_characters: 10 and new_after_n_chars: 10 but the page_content is definitely more than 10 characters. Not sure if that's us or unstructured. I'm using the proper unstructured cloud with our key.