tensorchord / pgvecto.rs-py

PGVecto.rs Python library
Apache License 2.0
6 stars 1 forks source link

new features [sdk] : insert with bool option ignore duplicate #13

Closed SebG-js closed 1 month ago

SebG-js commented 1 month ago

How to avoid inserting duplicate records in the database ?

Duplicates could concern every field : text, embedding, metadata dict or only some fields like text and/or embedding

alternatives : 1) insert with bool option ignore exact duplicate

2) insert with bool option ignore exact duplicate + duplicate keys list

3) to be defined update method ?

cutecutecat commented 1 month ago

In postgreSQL, the way to avoid inserting duplicate rows is add a UNIQUE constraint at create table. This is very trivial for postgreSQL native types like meta(JSONB) and text(TEXT), but impossiable for embedding(VECTOR). UNIQUE contraint is only available for btree index, which is not implemented by PGVecto.rs type VECTOR.

However, a vector is usually generated from a text by an embedding model like openai embedding. A same text will always generate a same vector. It's almost impossible for different text to generate a totally same vector. Thus we can say, text and vector are one-to-one mapping to a certain extent.

Therefore add a UNIQUE constraint to text is the same to a UNIQUE constraint at vector.

In next release, we will provide a way to add UNIQUE constraint to:

And here is the usage of them:

# Constraint on text
client1 = PGVectoRs(
        db_url=URL,
        collection_name="client1",
        dimension=3,
        recreate=True,
        constraints=[Unique(columns=[Column.TEXT])],
    )
# Constraint on meta
client2 = PGVectoRs(
        db_url=URL,
        collection_name="client2",
        dimension=3,
        recreate=True,
        constraints=[Unique(columns=[Column.META])],
    )
# Constraint on text & meta
client3 = PGVectoRs(
        db_url=URL,
        collection_name="client3",
        dimension=3,
        recreate=True,
        constraints=[Unique(columns=[Column.TEXT, Column.META])],
    )
# Constraint on text | meta
client3 = PGVectoRs(
        db_url=URL,
        collection_name="client3",
        dimension=3,
        recreate=True,
        constraints=[Unique(columns=[Column.TEXT]), Unique(columns=[Column.META])],
    )

The feature PR is ready and will be posted for one week. Please let me know if you have any questions or comments.

SebG-js commented 1 month ago

Hello Thank you for your explanations. I understand very well the usecase of the AND constraint like text + meta. I am not sure to understand the OR usecase (Constraint on text | meta). It could be tricky.

With the method you propose, it will prevent inserting duplicates. Nice!

To modify a previous record, we have to delete it and recreate it ?

cutecutecat commented 1 month ago

I am not sure to understand the OR usecase (Constraint on text | meta).

Let me explain it in a simpler way. An OR constraint can be explained as:

Clause 1 can be set as constraint on text, and clause 2 can be set as constraint on meta, while clause 3 can be covered by any of them.

So a OR constraint can be simply translated as two constraints~